Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Usability Evaluation and Retrieval Performance of Google, AltaVista and Yahoo using TERS platform
A study submitted in partial fulfilment Of the requirements for the degree of Master of Science
In Information Management
at
The University of Sheffield
by
Kessopoulou Eftychia
September 2005
1
Abstract
This research sets out to compare three popular search engines, namely Google,
AltaVista and Yahoo by evaluating their retrieval performance and usability. The end
result sought is: which search engine satisfies equally and more efficiently the retrieval
performance and usability criteria. The objectives of this study are a) to examine if the
user is capable to successfully formulate queries, and evaluate results by measuring the
effectivity, efficiency and user satisfaction, b) to examine which system is able to
retrieve relevant documents and rank more relevant documents higher in the list of
retrieved results by using the recall and precision measures, and c) to see the correlation
between retrieval performance and usability testing.
In order to conduct the appropriate experiment, the researcher employed a
specialized platform, called TERS (Testbed for the evaluation of retrieval systems), and
designed by Dr. Roelof van Zwol. Thirty (30) participants were recruited and a test
collection of thirty (30) topics was prepared. After each participant conducted fifteen
(15) different searching tasks, the researcher evaluated their searching results.
The conclusions of the study indicate that there is a conflict in results between
usability and retrieval performance. Thus, despite the fact that Google performed better
than the other search engines when usability was tested, Yahoo seemed to have a better
retrieval and ranking algorithm.
2
Table of Contents 1. Introduction 6
2. Literature Review 9
2.1 Information Retrieval Evaluation 9
2.2 Experimental Web Search Engine Evaluation Background 13
2.3 Criteria of Evaluation 20
3. Methodology 24
3.1 Introduction 24
3.2 Search Engine Evaluation Methodology 24
3.3 Data collection, types of data and data instruments 26
3.3.1 Data collection and types of data 26
3.3.2 Data instruments 26
a) Questionnaire 26
b) Observation checklist 27
c) Transaction logs 27
3.4 Experimental Design 28
3.4.1 Search Engine Features 28
3.4.2 Experimental Scenario 30
3.4.3 Test Environment 31
3.4.4 Participant Sample 31
3.4.5 Information needs (Topics) 31
3.5 Experimental Procedure 33
3.5.1 Ethics Approval 33
3.5.2 Pilot experiment 33
3.5.3 Configuration of TERS platform 34
a) General Specifications 34
i) Retrieval system identification 34
ii) Definition of the ‘topics’ 34
iii) Creation of user accounts 34
3
b) Usability experiment specifications 35
i) Survey questions set-up 35
ii) Participant workload in the usability experiment 35
c) Retrieval Performance experiment specifications 36
i) Participant Workload in the retrieval 36
performance experiment 36
ii) Run set-up 36
3.5.4 Presentation of TERS, Pre-session Questionnaires, Practice 36
3.5.5 Usability experiment section 37
3.5.6 Post-session 38
3.5.7 Judgment of Usability relevance assessments 38
3.5.8 Retrieval performance experiment section 38
3.6 Result Analysis 40
3.6.1 Retrieval Performance measures 40
a) Precision and Recall measures 40
3.6.2 Usability measures 40
a) Effectivity 40
b) Efficiency 41
c) Satisfaction 41
4 Results and analysis 42
4.1 Introduction 42
4.2 Participant’s profile 42
4.3 Retrieval performance experiment results and analysis 46
4.3.1 Recall - Precision results 46
a) Interpolated Recall-Precision 46
4.4 Usability experiment results 49
4.4.1 Overall analysis 49
4.4.2 Search engines usability testing results 51
4.4.3 Topics analysis results 54
4.4.4 User results analysis 57
4
5. Limitation of the experimental study 62
5.1 Search engines 62
5.2 Participants 63
5.3 Test collection 63
5.4 Observation data 63
5.5 Researcher’s interference 64
6. Discussion 65
7. Conclusion 66
References 68
Appendix A 73
Appendix B 76
Appendix C 86
Appendix D 87
Appendix E 93
5
1. Introduction
The www is a popular internet application which first appeared in 1991. It has
two major functions: publishing of information by web authors and retrieving it via the
process of searching by web users. In the beginning the relocation of information into
online resources was not as complicated as only a respectful number of resources
became available and it was indexed into “an alphabetized listing of links to pages”
(Schartz, 1998:974). But as time passed, the bulk of resources in this “hyperlinked
collection” (Kumar, 2005:3) increased amazingly and a need for an information retrieval
system to find this information became essential. It was then, not until 1994, that then
search engines came into sight and start developing.
A search engine is mainly comprised by a robot, a database and an agent. The
robot or spider is a program-algorithm that crosses the www “determines the quality and
quantity of information it accesses and retrieves for its database” (Scales & Felt quoted
by Dong and Su, 1997: 69). The database is an indexed list of all the information that the
robot gathers. The agent is the intermediary between the user that requests the
information and the search engines database. When the user formulates a search query
the agent searches the search engine’s database and retrieves the relevant documents for
the user, presenting them in a list with the most popular results appearing first. In order
the web user to communicate with the retrieval system an interface is used which gives
him/her the opportunity to formulate a query and take a presentation of the results. Thus,
the more effective the search engine is the higher it rates in the user’s choices and the
sponsors preferences, which are the basic resource of revenues to the web search engine
authors. Consequently, the significance of the search engine comparison and evaluation
according to Dong and Su (1997:67) is of “great importance for system developers and
information professionals, as well as end-users, for the improvement and development of
better tools”.
6
The evaluation of information retrieval systems has a history of 40 years and is
one of the most challenging fields in information retrieval domain. Information retrieval
according to Tague-Sutcliffe (1996) “is a process where sets of records are searched to
find items which may help to satisfy the information need or interest of an individual or
a group”. Evaluation of retrieval systems is involved with the ability of a system to meet
the information needs of its users (Voorhees, 2002).
There are two major approaches for the evaluation of information retrieval
systems performance. The one is the traditional system oriented approach which focuses
on “improving the match between the query and the document as well as the
specifications of the computationally effective and efficient retrieval algorithms” (Pors,
2000). Thus, in order to evaluate how well the system can rank relevant documents
precision and recall measures or other variations of this notions were used exclusively
by researchers. The other approach is user centered and it takes into account the
interactivity between the user and the information retrieval system, with respect to user’s
satisfaction. Due to the fact that this perspective is relatively new the measures used are
still in doubt. Several measures have been employed in many studies concerning the
comparison and evaluation of search engines from this point of view. According to Su
(1992:503) though, “there is a lack of agreement of which are the best existing
evaluation measures”.
During the previous years both approaches were applied in IR experimental
environments for the performance and evaluation of search engines, but not
simultaneously. Each perspective was giving results measuring only the variables from
its point of view and there was no attempt for any correlation between the 2 approaches
as there was no experiment combining them. In 2004 Dr. Roelof van Zwol (2004)
devised an evaluation platform that gives the ability to conduct simultaneously both
usability and retrieval performance evaluation studies. This raises the research topic
which is a combination of both evaluation perspectives by using DR. Roelof van Zwol
platform, the so-called TERS (Testbed for the evaluation of retrieval systems). This tool
7
will be used for the comparison of three popular search engines, namely Google,
AltaVista and Yahoo.
In brief the objectives of this experiment are:
• to examine if the user is capable to successfully formulate queries, and evaluate
results by measuring the effectivity, efficiency and user satisfaction
• to examine which system is able to retrieve relevant documents and rank more
relevant documents higher in the list of retrieved results by using the recall and
precision measures
• to see the correlation between retrieval performance and usability testing.
This research is divided in 8 chapters: a) literature review of relevant experimental
studies, b) introduction of the methodology applied, c) experimental design, d) report
and analysis of the results, e) limitations of the experimental study, f) discussion and g)
conclusion.
8
2. Literature Review
This section is dedicated to the review of the two major perspectives, system-
and user-oriented, that are used for information retrieval evaluation and subsequently for
evaluation of web search engines. The following paragraphs include an analysis of the
aforementioned methodological approaches, an overview of the criteria used, as well as
search engine related comparative studies taken place in the past.
2.1 Information Retrieval Evaluation
In information retrieval (IR) evaluation is an important notion that requires
further development. Van House et al. quoted by Dong and Su (1997: 68) provides a
rather wide definition of the word evaluation as “the process of identifying and
collecting data about specific services or activities, establishing criteria by which their
success can be assessed and determining both the quality of the service or the activity
and the degree to which the service or activity accomplishes stated goals and
objectives”.
The fundamental goal of an information retrieval system is to locate relevant
documents with respect to users’ requests. Thus, the evaluation of an IR system focuses
on two major aspects: to improve the retrieval performance of the system and to
maximize the user satisfaction during the process of retrieval.
IR evaluation started in the 1950’s at the Granfield College of Aeronautics in
England, and it was basically involved with the measurement of the retrieval
performance of a system, namely the system-centered approach. During the 50’s, the
idea that “the system knows better” rulled. The evaluation of an IR system was based on
the improvement of the match between the query and the documents retrieved from the
9
systems database, by developing effective and efficient algorithms. The typical example
of this approach is the Granfield paradigm.
The system-oriented methodology is in general grounded in experimental basis
taking place in a laboratory environment. According to Voorhess (2002:355), the choice
of location depended on the effortless ability of the researchers to “control some of the
variables that affect the performance of the systems and thus increase the power of the
comparative experiments”. The experimental scenario involved “a collection of
documents, a set of example information requests and a set of relevant documents for
each example information request” (Baeza-Yates, 1999:74). The ‘information requests’
known also as ‘queries’ were “considered to be representations of the information
needs” (Pors, 2000:60) of the user. The ‘set of relevant documents for each example
information request’ known also as ‘relevance judgments’ were not made by the user,
but instead were provided by the experimenters and were “objective and static, which is
the reason for their often-binary role” (Pors, 2000:60).
According to Pors (2000), criticism against this approach is based on the set of
documents, the queries and the concept of relevance. He states that traditional
experiments are based on a test collection with a relatively small sample size as it is
based on only one type of documents. For instance, the Grafield paradigm was based
only on aeronautic topics. Moreover, Pors expresses his doubts that the queries represent
real information needs and he supports that the “methodology of relevance might be
skewed” (Pors, 2000:62).
This traditional approach is too narrow to be adopted for the evaluation of IR
systems because it does not take into account the variable of interaction. Thus,
researchers during 1990’s have paid a lot more attention to the user-centered approach,
which defines the IR system broader; even though it provides less “diagnostic
information regarding the system behavior” (Vorhees, 2002:355). According to Pors
(2000,) the interaction between the mechanisms of information retrieval -database, user,
10
representation language, IR algorithm, interface, user language and requests- is of great
interest to information retrieval research.
The point in the user-oriented evaluation approach is to investigate “how well the
user, the retrieval mechanisms and the database interact in extracting information under
real life situations” (Borlund, 2000:74). Thus, this approach according to Pors (2000) is
three-dimensional including: relevance, cognition and process orientation.
Concerning relevance, the user-oriented approach escapes and overcomes the
narrow assumption adopted in the system-centered approach, which correlates the stated
request with the information need. Robertson and Beaulieu (1992:458) suggest that
relevance should be judged according to users real information needs. Furthermore,
some “self- reported mental activities or external observations” should be employed in
order to approach this concept. Schamber -quoted in Borlund (2000:72)- investigated in
his study the elements that influence user’s relevance judgments and constructed a list of
eight variables. Moreover, he supports that his findings conclude that relevance is a
multi-dimensional concept and static binary assumptions are not acceptable.
From a cognitive viewpoint, during the process of information searching and
retrieval, the user is regarded to have a “certain anomalous state of knowledge” which
indicates his/hers information need (Belkin quoted by Robertson and Beaulieu
(1992:459). According to Borlund (2000:72), “this means that an information need, from
the users’ perspective, is a personal and individual perception of a given information
requirement and that an information need for the same user can change over time”.
In the process-oriented dimension, information retrieval is regarded as an
interactive process. The user has a problem to solve (information need) and therefore
he/she tries several mechanisms as his/hers state of knowledge develops or changes.
Thus, the process-oriented perspective focuses on the whole process of information
searching. Moreover, it tries to analyze the user’s behavior and find the elements that
affect it.
11
Presumably, this modern user-oriented approach is completely different form the
static ‘black box’ perspective, applied in the Granfield paradigm, where the user was of
no importance. The experiments adopting the user-oriented approach are mainly
operational and they differ from a laboratory setting on lesser control applied over the
variables. Operational experiments are closer to real life and include users for the
evaluation of IR systems. They are comprehensive, even though the difficulty of
measuring qualitative variables as well as the cost of such experimentation is higher
when compared to the traditional approach. This is the main reason that the system-
centered approach is not abandoned but is still used in many experiments.
Moreover, the user-centered approach is not so easy to conduct. Jones and
Willett quoted by Vorhees (2002:355), mention that “a properly designed user-based
evaluation must use a sufficient large, representative sample of actual users of the
retrieval system; each of the systems to be compared must be equally well developed
and complete with an appropriate user interface; each subject must be equally well
trained on all systems and care must be taken to control for the learning effect”.
Saracevic (1995) supported in earlier times the idea of using both methodological
approaches instead of adopting only one. He considers information retrieval evaluation
as an amalgam of “a system, a criterion or criteria, measures, measuring instruments and
methodology” (Saracevic, 1995:142). Moreover, he divides evaluation in six categories:
“engineering-dealing with hardware and software-, input-investigating inputs and
contents of the system-, processing-who the inputs are processed-, output-interactions
with the system-, use and user-questions of applications and given problems and tasks
raised- and social-issues of the environment impact” (Saracevic, 1995:140-141). He
addresses the system-centered approach applied in the processing level, and he expresses
the need to focus the evaluation on all six aforementioned categories. In his paper, he
severely criticizes Dervin and Nilan who suggested a methodological change from the
system-based to the user-oriented evaluation.
12
Bolrund (2000) accepts the importance of both approaches for the evaluation of IR
systems and consents with the suggestion of Beaulieu et al (1996) that there is a need to
“simulate a realistic interactive searching task within a laboratory environment”. She
proposes an experimental evaluation of IR systems that combines both “realism and
control” and consists of three basic elements:
• the involvement of the potential users as test persons
• the application of dynamic and individual information needs
• the use of multidimensional and dynamic relevance judgments (Bolrund,
2000:76)
Presumably, it is evident that both system-centered and user-oriented approaches
need to be taken into account in order to provide a broader framework for the evaluation
of retrieval systems. Especially, due to the current enormous expanse of the World Wide
Web, search engines, as specific, representative and interactive information retrieval
systems, are treated with great concern.
2.2 Experimental Web Search Engine Evaluation Background
Before reviewing modern experiments based on search engine evaluation, it is
worth referring to the cornerstone of IR evaluation experiments. This is the well known
Granfield paradigm, and more specifically the second run of tests conducted by
Cleverdon et al. in 1960’s, which is a typical example of laboratory experiment.
The researchers investigated which of the 33 indexing languages performed
better with respect to retrieval. The experimental scenario involved a ‘test collection’
comprised of 1400 documents relating to aeronautical engineering, a set of 331 ‘queries’
representing user needs and a set of ‘relevance judgments’, “where each document was
judged relevant or not relevant to each query” (Harter and Hert, 1997:8). The relevance
was based on topical similarity where the judgments were made by domain experts.
13
Cleverdon et al. devised the measures of recall, precision and fallout in order to measure
the effectiveness for each indexing language.
These traditional measures and especially recall and precision received a lot of criticism.
Voorhees (2002:356) when criticizing 3 hypotheses on which the Granfield paradigm is
based on, expressing his concern:
• Relevance can be approximated by topical similarity. Thus, all relevant
documents are equally desirable, the relevance of one document is independent
of the relevance of any other document and the user information need is static
• A single set of judgments for a topic is representative of the user population
• The list of relevant document for each topic is complete
TREC experiments bare a lot of similarities to the Granfield model but move a step
closer to interactive information retrieval with the adoption of ‘interactive track’ in
TREC. TREC experiments use a larger test collection, taking into account that the size
of the collection plays an important role in IR performance evaluation. The methodology
adopted is based on system-driven approach and the queries used, like in the Granfield
model, are formed in a standard way. On the other hand, the relevance judgments are
“created using the pooling method” (Harter and Hert, 1997:25), but the relevance is
measured here in the same way as in the Granfield model using the binary scale of
relevant and non relevant.
Borlund (2000) criticizes many aspects of the interactive TREC. She supports that
the queries, namely topics that are used, are “too static, limited and unrealistic” and
presumably the test persons are unable to develop their own interpretations over an
information request. Moreover, she comments more generally on the concept of
interaction stating that as far as TREC adopts a system-oriented approach this
consequently implies that the notion of interaction is disregarded. She concludes with
the assumption that TREC is not adequate for the evaluation of IIR systems because it
adopts the framework of IR system evaluation.
14
Although TREC experiments have received a lot of criticism about the methodology
and criteria adopted and their unsuitability to evaluate interactive information retrieval
systems, they remained a serious influential example to later interactive information
retrieval evaluation experiments.
Chu and Rosenthal in 1996 conducted an experiment to evaluate the performance of
three web search engines: AltaVista, Excite and Lycos. They had an adequate sample of
search queries for statistical analysis, which was constructed using variations of query
syntax. The reason being was their interest in studying search engine ability in handling
with different types of queries. For instance, some of the queries were single words
while others required the use of Boolean operators (Chu and Rosenthal, 1996). The
evaluation criteria that they used were those proposed by Lancaster and Fayen (1973).
Thus they measured coverage, precision, response time, user effort and form of output
but they omitted recall because they considered its measurement problematic in an
environment such as the World Wide Web. They grounded their opinion on the fact that
the number of relevant documents on the World Wide Web is indefinable regarding the
dynamic and unstable nature of this environment. As far as precision is concerned, they
calculated it taking into account the first ten hits after assigning score (1) for highly
relevant documents, (0.5) for fairly relevant and (0) for irrelevant. Leighton and
Srivastava (1997) criticize the fact that they studied only three search engines and they
did not use any significant test for their precision mean. Moreover, the relevance
judgments are made by the researchers themselves and this may involved any kind of
bias.
Ding and Marchionini (1996) evaluated the performance of three search engines by
taking into account the impact of hyperlinks. According to the researchers, as far as
World Wide Web is a hyperlinked environment they considered adequate to include in
their study “not only primary resources but also secondary” (Ding and Marchionini,
1996:139) because links to other relevant documents are of great importance to search
engine evaluation.
15
Ding and Marchionini used a small test suite of five queries which according to
Leighton and Srivastava (1997) is statistically insignificant. Moreover, the later
(1997:page 4) comment on the queries as “narrowly defined academic topics with the
use of multiple words”. The researchers of the study also accepted that the query
analysis and formulation as relevance judgments are subjective.
As far as the evaluation criteria are concerned Ding and Marchionini found
interesting to measure: precision of first twenty hits, duplication in the retrieved sets,
validation of links and the degree of overlap. The measures used for these criteria are
precision and recall in three types, salience and relevance concentration. Dong (1999)
comments positively on precision measured in this study as being based on the
“standardized formula” (Dong, 1999:148) and Gordon and Pathak regard that this
comparative evaluation performs sufficient statistical tests used for comparing search
engines.
Clarke and Willett (1997) conducted a study to compare the effectiveness of three
search engines using thirty queries based on library and information science topics. The
researchers calculated first ten precision and developed an algorithm to measure
approximate recall. They adopted Chu and Rosenthal relevance criteria and they
assigned (1) for relevant documents retrieved, (0.5) for partially relevant, and (0) for
irrelevant. Additionally, the authors extended the concept of relevance by assigning (0.5)
for a page that lead to one or more relevant pages –taking into account the effect of
hyperlinking- and (0) for duplicate sites. According to Oppenheim et al. (2000:197) the
study of Clarke and Willett is essential both because “of the critical evaluation of earlier
research that it provides and because it offers a realistic and achievable methodology for
evaluating search engines”.
According to Nicholson, Toamiuolo and Parker conducted the “first large scale”
study. They evaluated four search engines and two evaluative search tools, which had
the ability to review and evaluate the web pages that they index. For their study they
used two hundred query topics that they gathered from undergraduates. The precision
16
was calculated for the first ten hits and the relevance judgments were made by the
researchers themselves. Gordon and Pathak (1999) mention that the relevance judgments
were based frequently on “the short summary descriptions of the web pages that search
engines provide”. Moreover, Leighton and Srivastava (1997:page 4) state that the
researchers did not frequently check if all visited links were active and they criticize
Toamiuolo and Parker on their lack of defining criteria for relevance.
Hawking et al. (1999) compared five popular search engines during that time, using
the TREC-like methodology. The set of queries used, numbered 54, were gathered from
transaction logs by AltaVista and Electric Monk. They applied the use of scripts to
present each of the queries to each search engine and they calculated precision on the
first twenty hits. The relevance judgments were conducted by four judges each of whom
judged for one query all the documents retrieved, using binary relevance. According to
Su (2003:1177) the methodology that applied by Hawking et al. excels on providing
“reproducible results and blind testing”. Moreover, they did not take into account
inactive links when measuring precision.
Leighton and Srivastava (1999) compared five web search engines based on first 20
precision assigning different weights to the retrieved results. They divided three groups
for these 2twenty hits and they weighted differently the first three next seven and last ten
of twenty hits but they used arbitrary measurement. Moreover, they used four categories
of relevance by taking into account inactive and duplicate links and penalizing them.
They used fifteen queries of which ten were real reference questions and remaining five
queries were taken from a previous study conducted by them. Evaluator different from
the users, conducted the relevance judgments and they were unaware of the search
engines that returned the results judged. The way adopted was such to minimize any
bias. They used statistical techniques for non-normally distributed data and analyzed the
data several times using different relevance assessments.
User Oriented experiments
17
User-oriented experimental approaches conducted in the later years in order to
form same kind of methodology were involving real users.
One of the most up-to-date experimental studies is the one of Gordon and Pathak
(1999). In their experiment they compared the effectiveness of eight web search engines.
They used thirty three real information requests which they gathered from students of a
university business school. Presumably, the information requests where based on one
type of documents, limiting the scope of evaluation. When they gathered the requests
they employed expert searchers to conduct the searches on the behalf of the users. When
using expert searchers, other than the user, it “seems somewhat inconsistent” (Hawking
et al. 2001:37) with the requirement of the relevance judgments to be made by the
person that originally has the information need. During the search task, the search
intermediaries conducted the search repeatedly and interactively. They were instructed
to apply query optimization and accept the best results retrieved for each search engine
about a topic, before saving them. The idea of “near-optimal” queries according to
Hawking (2001) is interesting especially when they provide results of better
performance on comparison to simple queries. But Gordon and Pathak did not make any
similar comparisons. The top twenty hits for each search engine, with respect to the
queries submitted, were printed and sent to the users to judge their relevance using a
(4.0) relevance scale. Thus, as far as the users were only provided with the results,
‘blind’ judging was used to minimize bias. Moreover, they calculated recall, precision
and overlap at various cutoff levels. They also used statistical tests in order their results
to be more confident.
Su (1997) proposed a methodology which was adopted by Su, Chen and Dong
(1998). The researchers presented a pilot study evaluation of four popular search engines
using real users who searched for their personal information needs. The users also
judged the relevance, the features of the search engines, interaction, search results and
overall performance. The criteria employed in their study are relevance, efficiency, user
satisfaction, utility and connectivity. Later on Su and Chen (1999) performed the same
18
experiment using a larger test suite and based on the same methodology proposed by Su
(1997) but with some differences. The most important part of this experiment is that
each participant searched the same topic in each search engine, which numbered 4. The
purpose is to minimize the between-subject impact (Su, 2003).
Simulated experiments
Up until now according to Roelof Van Zwol (2004) there is no experimental
attempt to combine the two approaches. Borlund (2000) proposed an experimental
setting combining both system-centered and user-centered perspectives. She was based
on Borlund and Ingwersen (1997) previous methodological approach who introduced the
concept of “simulated work task situation”. Borlund investigated the possibility of
potential differences between real information need and simulated ones. She concluded
that based on empirical findings of meta-evaluation “simulated work task situations” are
suitable in IIR evaluation.
Roelof van Zwol (2004) conducted an experimental study combining the two major
approaches by constructing a self-made platform, namely TERS. For the user-centered
approach, widely known as usability study, he employed the measures of effectivity,
efficiency and satisfaction. The measure of effectivity illustrates the ability of the users
to find relevant information. Efficiency is involved with the time that users need to
successfully complete a topic. Satisfaction measures the degree of satisfaction of the
users, with respect to the information retrieval interface. On the other hand in order to
measure the retrieval performance of the search engines he used the traditional recall and
precision measures. The experimental scenario involved twenty users, who participated
in both experiments (usability and retrieval performance). The participants were divided
in three categories: beginners, novices and experts. The queries used numbered 37 and
they were divided in 18 specific topics and 19 more generic. The researcher was based
on Rouet’s (2003) criteria for the construction of the topics, in which he applied
statistical tests for differences between the two categories of topics. Moreover, he
19
calculated precision in different cutoff levels in order for his findings to be more
confident.
2.3 Criteria of Evaluation
‘Batch’ evaluation experiments applying recall and precision measures where
used exclusively in the Granfield paradigm. These standard measures have raised many
appraisals and objections in the information retrieval field, with regard to evaluation.
Hersh (2000:17) is in favor of these measures and assumes that “this is an effective and
realistic approach to determining the systems performance”. On the other hand,
disagreement towards this view have been placed concerning the deficiency in these
measures, the lack of recall validity, the drawbacks in measuring recall, presence of
interactivity and the ambiguous definition of relevance.
First of all, Large, Tedd and Hurtley (2001) characterize these measures
incomplete when used to evaluate information retrieval systems. They further introduce
other factors such as “expense”, “time” and “the ease of conducting the search via the
system interface”. Complementary, Lancaster and Warner, quoted by Large, Tedd and
Hurtley (2001:282), support that “accessibility” and “ease of use” are the most important
criteria for the user in choosing a source to retrieve information.
Moreover, the validity of the recall measure is controversial. The Granfield tests
were based on the hypothesis that a user is in favor of “finding as many relevant records
as possible” (Large, Tedd and Hurtley, 2001:282). This is an arbitrary assumption that
does not necessarily describe the average user behavior. It is acceptable that there are
many cases where the user is not interested in the retrieval of a great possible amount of
records but just in obtaining a specific record. When this is the case, then recall is
useless and precision can stand alone for the evaluation of the retrieval system.
Furthermore, another criticism regarding recall lies in its measurement when
applied in large databases or in interactive retrieval environments such as the World
20
Wide Web (Large, Tedd and Hurtley, 2001). The main reason is that in these fields the
amount of records available can not be gauged but it can only be estimated. It is widely
recognized that every day numerous brand new web pages come to light and others are
disabled. Thus, in order to count all these records found in these pages, it could take
infinite time and effort. Only, approximately recall is estimated even if it is enhanced
with additional characteristics such as “absolute retrieval” (Large, Tedd and Hurtley,
2001). On the other hand, maybe this measurement is not crucial in measuring
performance of interactive retrieval systems as Ralph, quoted by Jones and Willett
(1997), suggests.
Additionally, another drawback that emerged from these measures concerns the
concept of relevance. Two main issues are raised in this area, which relate to the
subjectivity and judgment of the relevance. Jones and Willett (1997:168), quoting
Cooper, concede “the property sought of retrieved documents is subjective utility to the
user rather than topic relevance, which is objective property of documents in
themselves”. Thus, the searcher examines what is relevant or not, with reference to the
information enquiry posed. Besides, with reference to the second issue, the user’s
opinion about a record is possible to be biased via other records earlier scanned. Albeit
to all these, Large, Tedd and Hurtley (2001) culminate in arguing that the measures of
recall and precision continue to be valid even if it is difficult to estimate them exactly
within experimental environments.
Moving further, interactivity constitutes another challenging domain for recall
and precision measures. It is comprehensible that these measures correlate to static
environments such as the Granfield experiments. However, nowadays there is a lot of
interaction between the system and the user. Thus, the evaluation methods that leave the
user outside tend to be pushed out to the limelight (Beaulieu, quoted by Jones and
Willett, 1997). Consequently, the need for a broader evaluation approach can not be
disregarded according to. This is the only solution in order to simulate real
circumstances because ‘batch’ searching results give misleading evaluation and thus
conclude in poor design of retrieval systems.
21
With reference to all the above, the evaluation of retrieval systems turns towards
the users as well. New measures are added for the usability evaluation. The most
representative measures are those proposed by ISO’s broad definition (Frokjaer,
Hertzum and Hornbeak, 2000): “effectiveness”1 2 3, “efficiency” and “user satisfaction” .
Frokjaer, Hertzum and Hornbeak (2000) emphasize the importance of accounting for all
three measures. Thus, they applied an experiment resulting in no correlation between the
three terms, and concluded that there is need to calculate all of them in order to measure
usability. Their results contradicted previous studies that “refrain from accounting for
effectiveness and settle for measures of the efficiency of the interaction process”
(Frokjaer, Hertzum and Hornbeak, 2000:346).
On the other hand, even if evaluating interactive systems using usability seems to
be an up-to-date approach, it faces important challenges that need to be taken under
consideration. The most imperative of them is the interference of the experimenter
(Jones and Willett, 1997). For example, when the investigator asks questions about what
the users are doing, the latter usually get disoriented especially if they are not familiar
with this. Moreover, the subjectivity of users is repeatedly surfaced, since each user
values the searches based on different characteristics. Additionally, another issue is
“nonreplicability” and Jones and Willett (1997:171) note that “once a user has done a
search seeking to satisfy some information need with some particular system or utility,
they cannot by definition, do another search for the same need with a different system”.
They attribute this to the knowledge gained previously and they argue that for in-depth
experiments large samples of individuals are necessary.
Recently, another vital notion that raised debate in the field of evaluation of
information retrieval, concerns the correlation between the two experiments. In his paper 1 “Effectiveness is the accuracy and completeness with which users achieve certain goals. Indicators of effectiveness include quality of solution and error rates” (Frokjaer, Hertzum and Hornbeak, 2000:345) 2 “Efficiency is the relation between the accuracy and completeness with which users achieve certain goals and the resources expanding in achieving them. Indicators of efficiency include task completion time and learning time” (Frokjaer, Hertzum and Hornbeak, 2000:345) 3 “Satisfaction is the users’ comfort with and positive attitudes towards the use of the system. It can be measured by attitude rating scales” (Frokjaer, Hertzum and Hornbeak, 2000:345)
22
Hersh (2001) explains the reasons that retrieval performance and usability studies do not
give the same results based on an experiment taken place by him. One point that he
makes is that users browse down the documents’ list that the system retrieves.
Contradictory, Roelof van Zwol (2004:388) supports, with regard to his experiment, that
“users are very well capable of finding the relevant information within the top ranked
documents. If they are not then they just refine their search rather than browse through
lower ranked documents”. However, generally Roelof van Zwol (2004) accepts the
findings of Hersh(2000) that there is disconnection between the traditional measures of
recall and precision and user satisfaction.
23
3. Methodology
3.1 Introduction
This chapter aims to describe the search engine evaluation methodology that it
was adopted by the researcher as well as the methods of collecting the data, the data
types and instruments that were employed for this purpose.
3.2 Search Engine Evaluation Methodology
Taking into account the dynamic nature of the search engines and the objectives
of this experiment, both methodologies, system-centered and user-oriented were
combined for the search engine evaluation. The system-centered approach is used
exclusively for the retrieval performance of the web search engines. This approach gives
the opportunity to compare the algorithms of different search engines in an attempt to
find out which of them can provide the most relevant documents. Thus, a collection of
documents that represent the information problem, a set of example information requests
(or simpler in search engine terms, a set of queries) and a set of relevant documents of
each example query are regarded as adequate criteria to measure the information
retrieval performance.
This seemed to be the approach since the 1990’s when a new approach
undermined the utility of the traditional system-centered approach and turned closer to
the user than the system. This new approach is diversified. From one viewpoint it tries to
identify the notion of relevance. From another perspective it is involved with the user
24
cognition. From a third dimension it emphasizes process orientation, which is attributed
to the interaction between the user and the system.
Experimental studies though continued applying the traditional approach because
it is much simpler than the new one, which required the presence of real users. The
reason being is that experimental studies due to their laboratory character require control
over some of the variables that are tested while user-centered approaches are operational
including qualitative data that is difficult to measure. Moreover, the cost of the later, as a
consequence of its nature is fairly higher in comparison to the laboratory one.
A combination of both approaches, the one measuring the usability and the other
measuring the retrieval performance was not attempted until 2004, when Dr. Roelof van
Zwol built an experimental testbed to facilitate the conduction of both experiments in an
attempt to evaluate information retrieval systems form a complete perspective.
This research is based on this idea and by using the testbed (TERS) will attempt
to evaluate the retrieval performance and the usability of three web search engines.
Additionally, this experimental study will investigate further the correlation of the two
experiments. Thus, different types of users were recruited to take part in the experiment
which included both realism and control.
The researcher used both quantitative and qualitative methods for the purposes of
this experiment in order to gather both data about the user-system interaction and more
system-oriented ones. Thus, the testbed was configured and questionnaires and
observation methods were used. Both quantitative and qualitative data helped to answer
the research questions about the users’ ability to successfully formulate queries and
evaluate results while only quantitative data helped to answer the research questions
referred to the systems ability to retrieve relevant documents and rank more relevant
documents higher in the list of retrieved documents.
25
3.3 Data collection, types of data and data instruments
3.3.1 Data collection and types of data
In order to extract information abut the system, the user and their interaction both
quantitative and qualitative data collection methods were used. Quantitative data helped
to describe the participants profile and to measure the retrieval performance of the
search engines. On the other hand qualitative data focused on the interaction between the
system and the user. More specifically the types of data used are: questionnaire data,
transaction log data and observation data.
3.3.2 Data instruments
a) Questionnaire
Two kinds of questionnaire were filled in by the participants: pre-session and
post-session. The pre-session questionnaire was divided in two parts due to the
shortcomings of adjusting its first part on the TERS platform. Thus, the first part of the
pre-session questionnaire (Pre-session Questionnaire I) had a paper-pencil form and it
was constructed using checklists. This questionnaire was used to gather data relevant to
the participants’ characteristics, experiences and preferences. The second part of the
questionnaire (Pre-session Questionnaire II) was uploaded in TERS platform and it was
displayed before the participants start the experiment. This questionnaire included a
seven-point Likert scale and used to map the participants’ searching experiences and
skills, when using web search engines to satisfy their information needs. On the other
hand, the post-session questionnaire which was again uploaded on the platform was
filled in after the participants were finishing the usability experiment. This questionnaire
26
gave the ability to the users to judge the usability of search engines on terms of search
options, result presentation, relevance, response time and overall satisfaction that he
participants get when using them. Using the post-session questionnaire the researcher
had the ability to map the attitudes of the participants towards each web search engine.
Three types of questions were included in this questionnaire: 7 point Likert-scaled
questions, Boolean questions (checklist that gives two options, from which the user has
to check one or another), and open questions (no obligation to be filled in).
b) Observation checklist
In the quantitative part of this project, the researcher used observation even
though it is regarded as a qualitative measure. The purpose of using observation in this
way was to focus on a specific aspect of the participants’ behavior: the searching
strategy that is followed when searching using web search engines. Thus an observation
checklist was constructed by the researcher for this purpose. The checklist included
general searching strategy characteristics that participants used in order to satisfy their
information need, which was not easy to be gathered by other data instruments. The
checklist assisted to quantify the behavior of the users in some way.
c) Transaction logs
“Every SQL Server database has at least two files associated with it: one data file
that houses the actual data and one transaction log file. The transaction log is a
fundamental component of a database management system. All changes to application
data in the database are recorded serially in the transaction log. Using this information,
the DBMS can track which transaction made which changes to SQL Server data”
(http://www.dbazine.com/sql/sql-articles/mullins-sqlserver). Thus, with the help of
transaction logs the researcher was able to gather data regarding the search and the
27
relevance judgments provided by users. All the data were saved in the TERS platform to
which the researcher had access via the internet technology.
3.4 Experimental Design
3.4.1 Search Engine Features
Due to the fact that search engine companies neglect to reveal information to the
public about their exact algorithms for indexing and ranking, the researcher provided
only information gathered from the three search engine web sites and from personal
experience.
i) Google
Google search engine supports both simple and advance search mode. For each
mode a different interface is provided. The simple interface (Appendix A) has a single
box where the user can type the query terms. There is the option of either displaying a
list of results or just limit the result to one, which is the most relevant that Google finds.
The advanced mode (Appendix A) provides options of searching of pages that include
all the search terms, an exact phrase, at least one of the words, not any of the words,
written in a certain language, created in a certain file format, that have been updated
within a certain period, that contain numbers within a certain range, within a certain
domain or website, that do not contain “adult” material. Google ignores common words
and characters such as where, the how and other digits and letters which slow down the
search without improving the results. Moreover, it is not case sensitive, it adds
automatically “and” operator between the queries, uses stemming technology and it does
not support truncation. (www.google.com)
ii) AltaVista
28
AltaVista search engine offers simple search and advanced (Appendix A,). The
simple search mode provides a single box where the user can type a query. The
advanced search mode gives the opportunity for building queries that include all the
words, an exact phrase, any of the words, or none of the words and return results in any
of the 36 languages, at specific time range, from specific locations or domains. Also, it
provides an option of displaying either a maximum of two results per site or all the
results, with results form the same site. Moreover, there is an additional free-form
Boolean query box where expert users have the opportunity to build a query using
Boolean operators. Additionally, AltaVista indexes all of the words on each web page,
gives the chance of translating a web page, does not support truncation and it is case
sensitive. (www.altavista.com)
iii) Yahoo
Yahoo search engine provides two types of search: simple and advance
(Appendix A). Simple search is the first option here as well as in the previous search
engines. The advanced search mode gives the chance to the user to limit the search
showing results with all of the words, at least one of the words, an exact phrase, none of
the words, updated, from a certain site or domain, containing “adult” content, from a
specific country, in a specific language, with a specific number per page, who are related
to a particular site and in a specific file format. Additionally, it supports truncation and it
is case sensitive. (http://search.yahoo.com/)
29
3.4.2 Experimental Scenario
The experimental scenario involves the usability and the retrieval performance
experiment using TERS platform. Three popular search engines were used for
comparison: Google, AltaVista and Yahoo. Thirty (30) volunteers were recruited as
testing searchers and relevance judges. This means that they participated in both
experiments. The number of topics (information needs) was thirty as well. In the
usability experiment fifteen (15) topics were assigned to each participant. Each
participant searched these topics in the search engines that were previously set by the
researcher. A different search engine opened each time the participant assessed a topic.
None of the participating students had previous knowledge of the search tasks that they
assessed in the different search engines. The participants would conduct the search as
they would normally do in their life. They had the option of following the links to satisfy
their information need or altering the search if they were not satisfied by the results.
When they were finding a result that answered the information problem then they copied
the information of the URL, the position of the link in the ranked list, the search words
that they used and a relevant fragment from the web page that lead to the decision that
they satisfied their information need. Observation notes were taken by the researcher
during the usability experiment. After the assessment of all topics by all participants, the
researcher judged the results for their correctness in order to investigate the ability of the
searchers to find relevant information.
In the second part of the experiment, the retrieval performance was based on the
first fifty hits retrieved by the search engines. The researcher optimized the queries for
each topic and the participants were requested to provide relevance judgments for these
results using a (5.0) scale. Each participant had the chance to choose one topic of
his/hers preference and provide relevance judgments for all fifty topics retrieved for this
topic.
30
The experiment took place in the “Microlab” lecture theater of the Department of
Information Studies at the University of Sheffield from 15th July until 19th August 2005.
3.4.3 Test Environment
The “Microlab” lecture theater is equipped with desktop computers with internet
facilities. This was a prerequisite in order TERS toolkit to be uploaded on the browser
for each session. There were ten sessions of three students taking part in each session so
as the researcher to be able to observe the participants.
3.4.4 Participant Sample
For the purposes of this study thirty participants (30) were recruited. All of them
were students at the University of Sheffield. Four (4) of them were research students and
the remaining twenty six (26) postgraduates. The academic background of postgraduate
participants varies in number: three (3) information studies, five (5) computer science
studies, five (5) engineering studies, one (3) medicine studies, two (2) political studies,
one (1) human geography studies, four (4) business and management studies, 0ne (1)
law studies and two (2) linguistics. The research students’ background was: one (1) from
architecture, one (1) from engineering, and two (2) from medicine. All participants in
the experimental study were volunteers. In order to find if there are differences between
the user’s performances the students were divided in 3 categories as suggested by Roelof
van Zwol (2004). These categories are: beginners, novices and experts.
3.4.5 Information needs (Topics)
In the information retrieval field there is a lot of debate relating to the nature of
the information needs used in the experimental design of various studies. According to
Borlund and Ingwersen (1997) when information needs are stimulated their dynamic
31
nature emerges during experimentation. TREC adopts an interesting approach when
generating topics. TREC test collection includes an example of indicative information
needs which are used when testing search engine performance. Each indicative
information need is referred to as “topic”. The “topics” are written by experienced users
of real systems and are representing real information needs. In this experiment the test
collections of TREC that refer to World Wide Web were used because they include a
variety of topics. The researcher preferred to avoid constructing the information needs in
an attempt to avoid any kind of bias. More specifically the information needs, namely
“topics” that were used in the experiment are chosen from three test collection provided
by TREC: from TREC 9 web track, from TREC 6 and form TREC 8 web track the ad
hoc and small web topics test collection (http://trec.nist.gov/data.html). The choice of
the “topics” from this collection was based on the preferences of 10 students with
different academic background different from the participants who volunteered to take
part. The list of the “topics”, numbered (30) and an additional “topic” for practice, is
presented in the (Appendix B) enhanced with the description, which provide specific
information on the information need and the narrative, which gives further information
of what is regarded as relevant to each “topics” specific information need.
For the purposes of this study in order to investigate any differences between the
topics the researcher based on Rouet’s (2002) criteria divided the “topics” into two
categories. One category comprised of topics with generic nature while the other
category included “topics” which were more specific. The whole list with the two
categories is provided in (Appendix B, page)
32
3.5 Experimental Procedure
3.5.1 Ethics Approval
Prior to implementing this experiment, ethical approval was obtained from the
Research Ethics Committee at the University of Sheffield. The researcher then
contacted the participants explaining the project, provided an information sheet with
further information about the experiment and asked their consent to participate. Consent
forms were provided to each student. Once this process was completed the research
design was implemented.
3.5.2 Pilot experiment
As in every experimental project, a pilot experiment was conducted. The purpose
of this primary attempt was to verify the feasibility of the experimental procedure by
testing all the features of the TERS platform with the use of a demo. Moreover, a
secondary objective was to calculate the time required for every session both for the
usability performance and the retrieval performance part. The pilot experiment took
place in the “Microlab” lecture theater in the Department of Information Studies of the
University of Sheffield and two potential participants, both of them university students,
an expert and a novice took part. The ‘topics’ used for the retrieval performance were 5
and for the usability 1. After the pilot study, considering the time required for the
completion of both stages of the experiment, the researcher made alterations concerning
the number of ‘topics’ assessed by each participant.
33
3.5.3 Configuration of TERS platform
The configuration of the TERS toolkit involved the following steps: general
specifications, retrieval performance experiment specifications and usability experiment
specifications. The general specification include: retrieval system identification,
definition of the ‘topics’ (search assessments), creation of user accounts. The usability
performance experiment specification include: setting up of the survey questions and
identification of the participants’ workload. The retrieval performance specifications
involve: setting of the runs for each ‘topic’ and identify the workload for each
participant. In order to configure TERS platform the researcher constructed files for all
the aforementioned characteristics and loaded them to the platform.
a) General Specifications
i) Retrieval system identification
In this section the researcher identified the three search engines that will be
subject to evaluation by constructing a file and loading it into the platform. The contents
of the file are illustrated in the Appendix D.
ii) Definition of the ‘topics’
In this part the researcher defined the ‘topics’ and uploaded them to the platform.
All the topics were based on the TREC test collections. A sample part of the file that the
author constructed is available on the Appendix D.
iii) Creation of user accounts
34
The thirty (30) participants that took part in the experiment were given a
username and a password to have access to the experimental platform. Both for the
usability and the retrieval performance the participants were using the same
identification details (username and password). With reference to the ethics concerning
this study, further information about the participants can not be provided.
b) Usability experiment specifications
i) Survey questions set-up
In this section, the questions that are asked to the participant in the usability
experiment are defined. There are two types of questions: general questions and system
survey questions. The general questions are being asked at the participant before
carrying out the usability experiment (Pre-session Questionnaire II) and are about the
search behavior of the participant. The system survey questions are being asked at the
end of the usability experiment and investigate the satisfaction of the participant with
respect to a particular retrieval system. There are three types of questions: 7 point Likert-
scaled questions, Boolean questions, and open questions. For the scaled and Boolean
questions a low and high value are specified, indicating how the scale should be
interpreted. The open questions are not obligatory to be answered. A list of all survey
questions is presented in Appendix D.
ii) Participant workload in the usability experiment
In this section the workload for each participant in the usability experiment is
defined. In order all ‘topics’ to be assessed in all search engines, the researcher designed
the workload equally spreading the ‘topics’ between search engines and participants. An
35
example list of how the workload is divided among search engines and participants can
be found in Appendix D.
c) Retrieval Performance experiment specifications
i) Participant Workload in the retrieval performance experiment
Taking into account the pilot study conducted earlier and the excess in time
required for the retrieval performance experiment, the researcher assigned only one
‘topic’ to each participant. The list of the participants and the topics assigned to them is
provided in the Appendix D.
ii) Run set-up
Due to the time limitations each topic was assessed in only one search engine.
Each run was consisting of the top 50 results that were retrieved by a particular search
engine. The researcher gathered the first 50 results for each topic and uploaded them to
the platform. A typical example of the one run is available in the Appendix D. Before
the experiment all links where checked.
3.5.4 Presentation of TERS, Pre-session Questionnaires, Practice
The experiment started with a 10-minute PowerPoint presentation in order the
participants to get accustomed with the use of TERS platform. During this time
volunteers had the chance to ask questions and withdraw if they were not interested. The
remaining participants continued by filling in the Pre-session Questionnaire I (Appendix
D,), which hold general information about the testers, and logged in the platform. The
Pre-session questionnaire II had more specific information relating to the participant’s
36
searching experience and skills. There was an example ‘topic’ for practice in the
usability experiment, before they start the original search.
3.5.5 Usability experiment section
Fifteen (15) ‘topics’ were assigned to each participant. The testers had no previous
knowledge about the ‘topics’. Therefore the usability experiment had to be conducted
first because the topics used in both experiments were the same. Each participant
assessed each ‘topic’ to one of the three (3) search engines already set by the researcher
on the platform. The participant after understanding the indicative information need was
forming a query to the system that was opening each time via the platform. From the
retrieved rank list that the search engine was providing the participant was choosing a
link that was satisfying his search request. When finding the appropriate to his/hers
information need document he/she was coping the following information to the
platform:
• URL of the link satisfying his need
• Search words used when forming the query in the search engine interface
• Position that the link satisfying his/hers need had in the ranked list that the search
engine was providing
• A relevant fragment of the document that made him/her deciding to choose this
specific document
When finishing with one ‘topic’ the participant could move to the next ‘topics’,
without this being obligatory. The testers could postpone one difficult search and return
later to accomplish it.
37
3.5.6 Post-session
When finishing the searching for all 15 ‘topics’ the participants had to judge each
system by filling in a post-session questionnaire (system survey). After the completion
of the usability session testers could take a 15-minute break before continuing with the
retrieval performance experiment.
3.5.7 Judgment of Usability relevance assessments
Before calculating the statistics of the usability experiment, the correctness of the
answers found by the participants had to be determined. Thus, the researcher checked
the correctness of each of the fifteen (15) answers provided by the participants for each
‘topic’. The examination was based on a two-point scale, where 0 mean that the answer
given by the participants was not relevant and 1 that the answer given was relevant. The
tester could only see the ‘topic’ with the narrative of further information about what is
relevant and what is not. Both the users and the systems information where codified.
3.5.8 Retrieval performance experiment section
In this part the user was requested to provide relevance judgments about one ‘topic’
(indicative information need). The relevance judgments were based on the ‘topic’
description and on a brief narrative, which provided further information about which
documents actually satisfy the information need or which document are not relevant to
the information need. A list of the first 50 hits about each ‘topic’ was displayed on the
platform and the participant had to judge all 50 hits of one ‘topic’, using a 5-point scale:
• highly relevant when the whole document is relevant
• relevant when there is a relevant fragment only in the whole document
• irrelevant
38
• highly irrelevant, when the document is a complete miss. This category also
includes dead links
• not sure, when he/she was unable to judge the relevance
39
3.6 Result Analysis
3.6.1 Retrieval Performance measures
a) Precision and Recall measures
For the measurement of the retrieval performance experiments normally the
measures of recall and precision are employed. Considering the fact that the answer set
of the information requests is ranked a measurement can not be based only on these
measures. It is evident that the retrieval strategy plays an important role in interactive
environments like the search engines. Thus, the curve of precision versus recall
expresses the impact of the retrieval strategy. This curve is calculated by focusing on 11
standard recall levels, for which precision is measured. Instead, the average precision at
given cut-off level can be measured. This approach gives information about the retrieval
algorithm as well. Thus, if an information request has 40 relevant documents, then
precision is measured after 40 documents. This avoids some of the averaging problems
of the "precision at X documents". If average precision is greater than the number of
documents retrieved for an information request, then the non-retrieved documents are all
assumed to be non-relevant. These are the guidelines that trec_eval tool (version 7.0)
uses. TERS platform applies this tool for the measurement of recall and precision.
3.6.2 Usability measures
a) Effectivity
Within TERS, in order to find if the participants can find relevant information
the (effectivity) two measures are used: correctness and positioning. Correctness defines
the degree to which the answers to topics are correct and it is calculated as: number of
correct answers / total number of answers. It has a scale from 0 to 1. Positioning
40
specifies the ability to rank correct answers in a high position. For each answer the
position in the ranking is recorded. If the document’s position is greater than 10, it is
given position 11. The formula used is: sum(11 – position of a correct answer) / total
number of answers. Its scale varies from 0 to 10.
b) Efficiency
Efficiency is a criterion that measures the length of time that users need to
successfully complete an information problem (topic). It is calculated as: correctness /
total amount of seconds. Its scale varies from 0 to 1.
c) Satisfaction
TERS platform measures users satisfaction by taking into account the answers
given to the survey questions (Post session Questionnaire) by the participants. An
overall satisfaction score is calculated by averaging the scores of the questions that are
associated with a search engine. The formula used for this measurement is: sum(score) /
total number of related questions. The measure has a scale from 0 to 7.
41
4. Results and analysis
4.1 Introduction
All thirty (30) participants took part in both parts, usability and retrieval
performance, of the experiment and returned the questionnaires completed. In this
chapter the analysis of the gathered data starts with a presentation of the participants’
profile. Then the results of the retrieval performance and usability experiment are
presented and analyzed.
4.2 Participant’s profile
Among thirty participants, sixteen of them are at the age from twenty one to
twenty five. Nineteen of them are male and eleven are female. Twenty six are
postgraduate students and four of them are research students. There is a variety of
academic background between the participants varying between information studies,
computer science studies, engineering studies, medicine studies, political studies, human
geography studies, business and management studies, law studies and linguistics.
Table 1 avg
(mean) Std
1. How many years have you been using search engines? 5.10 1.24 2. What kind of search engine user are you? (beginner expert) 4.86 1.59 3. How often do you use the Internet to find information? (monthly daily) 6.66 0.60
4. Does your search for information in that case always lead to satisfying results? (always never) 5.53 0.93
5. How often do you use the advanced options when you search using the web search engines? (never always) 3.93 2.08
42
6. I know what "Boolean search" is (strongly disagree strongly agree) 3.36 2.68 7. I use "Boolean search" (strongly disagree strongly agree) 2.63 2.22 8. I know what a "ranked list" is (strongly disagree strongly agree) 4.10 2.59 9. The "advanced" search mode in the web search engines is an effective tool ((strongly disagree strongly agree) 5.10 1.42
All tables of analytical results are available in Appendix E
Analyzing Table 1 the participants have the following characteristics: a) most of
them started using the search engines five to six years ago, b) they regard themselves as
neither beginners nor experts, c) they use quite often the internet to find information, d)
their search for information most of the times leads to satisfying results, e) though they
accept that the “advance” search mode is an interesting tools, they do not seem to use it,
f) though some of them know how to make a more effective search they do not seem to
do so.
Taking into account the answers given by the participants it seems that most of
them have been using search engines five to six years. Comparing their age and the
years of web search engine use it can be accepted that a great number of them started
using search engines during their studies at the university. This is illustrated in the Graph
1 below.
43
1 1-2 3-4 5-6 7-8 9-10 over 10
Years of search engine use
0
1
2
3
4
5
6 Age 21-25 26-30 over 35
Graph 1
Moreover, regarding search engine preference, the results were astounding. Out
of all thirty (30) participants it is amazing to find out that Google rates higher in their
preferences. Google gathered all thirty participants’ answers while the other search
engines, AltaVista, Yahoo, Alltheweb, EntireWeb and Search.com seemed not to
interest the sample at all.
44
0
5
10
15
20
25
30
Participants
Goo
gle
Alta
Vist
aYa
hoo
Allth
eweb
Entir
eWeb
Sear
ch.c
om
Graph 2
Table 2: How the participants alter their search when not satisfied with the results Participant
Answers Percent
1. do nothing/give up 0 0 2. ask somebody for help 0 0 3. use the “help” option that the web search engine provide
3.3 1
4. change the web search engine 2 6.7 5. use the “advance” search mode 5 16.7 6. change the query slightly 22 73.3 7. change the query completely 0 0 TOTAL 30 100
Taking in to account the answers given by the participants a high number of
them calculated to 73.3% prefer to change the query slightly as a first option, while
16.7% change the mode from simple to advanced. Only 6.7% change the web search
engine, while 3.3% use the “help” option.
45
4.3 Retrieval performance experiment results and analysis
4.3.1 Recall - Precision results Table 3 Google AltaVista Yahoo Retrieved 500 500 499 Relevant 242 170 258 Relevant Retrieved 242 170 260 Recall 1 1 1 Precision 0.4861 0.4622 0.5266
From the table it is excluded that Yahoo had a better precision, with Google following and AltaVista coming last.
a) Interpolated Recall-Precision Table 4 Interpolated Recall-Precision Averages Google AltaVista Yahoo 0 0,6737 0,9117 0,8533 0,1 0,5917 0,8442 0,7217 0,2 0,5877 0,6030 0,6636 0,3 0,5669 0,5411 0,6143 0,4 0,5251 0,4980 0,5934 0,5 0,5251 0,4884 0,5779 0,6 0,5185 0,4800 0,5762 0,7 0,5100 0,4800 0,5688 0,8 0,5022 0,4800 0,5627 0,9 0,4896 0,4662 0,5330 1 0,4861 0,4627 0,5266
In order to compare the retrieval performance of the three search engine algorithms over all test queries, the average precision was used at each recall level. According to Yates (1999:77) “since recall levels for each query might be distinct form the 11 standard recall levels, utilization of an interpolation procedure is often necessary”. The curve of precision versus recall is illustrated below resulting from the data table above.
46
Recall-Precision curves
0,0000
0,2000
0,4000
0,6000
0,8000
1,0000
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
11 standard recall levels
Pre
cisi
on Google AltaVista Yahoo
Graph 3
Moreover, the average precision in different cut off levels is shown below:
Table 5 Documents Google AltaVista Yahoo 1 0.3000 0.8000 0.7000 2 0.3000 0.6000 0.6000 3 0.3000 0.4667 0.6000 4 0.3500 0.4500 0.6000 5 0.3600 0.4400 0.6000 6 0.4167 0.4000 0.5667 7 0.4286 0.4000 0.5429 8 0.4125 0.4000 0.5375 9 0.4000 0.3667 0.5333 10 0.4000 0.3600 0.5300 15 0.4200 0.3400 0.5267 20 0.4350 0.3400 0.5150
47
Precision at document cutoff levels
0,00000,10000,20000,30000,40000,50000,60000,70000,80000,9000
1 2 3 4 5 6 7 8 9 10 15 20
Document cutoff levels
Pre
cisi
on GoogleAltaVistaYahoo
Graph 4
48
4.4 Usability experiment results
4.4.1 Overall analysis
Table 6
Topics Time Correctness Positioning Efficiency Satisfaction
total total average stddev min max average average stddev correctness avg positioning stddevrate
450 120609 268 281 41 3154 0.78 7.83 2.64 0.0029 0.0286 5.01 1.55 354 96941 273 303 41 3154
49
Overall from the four hundred topics (450) assessed by the participants the
researcher inspected all of them and found that three hundred and fifty four (354) topics
were correct. This means that 78.66% of the answers given were correct. Presumably,
participants were competent enough to search and retrieve relevant documents. The
participants needed on average thirty three hours and a half (33.5) to search for these
topics on the web search engines. The average time spend by topic was calculated to
four minutes and forty six seconds (4.46 minutes).
In order to measure the first element of usability the researcher took into account
the correctness and positioning as analyzed in the section 3.5 regarding result analyses.
Average correctness was found to be 0.78. Taking into account that the scale of
correctness is from 0 to 1 it is evident that the result is adequately high. Moreover, the
researcher calculating positioning found that the average positioning is 7.83. As, the
scale for this measure is from 0 to 10 it can be comprehensible that the positioning level
is also high. Presumably, effectivity, which is regarded as one measurement of
estimating usability, answers positively to the question referring to the ability of the
users to find relevant information.
Additionally, the other important element of usability is efficiency. Efficiency
refers to the length of time that is required by the users to successfully complete a topic.
Thus, knowing already the correctly answered topics efficiency is 0.0029. On a scale
from 0 to 1 this score is not very high. On the other hand is surprisingly low, which
means that the user do not need much time to successfully complete a topic.
Finally, the third component of usability, satisfaction is found to be on average
very high. Its score is calculated 5.1 on a scale from 1 to 7.
Presumably, taking into account all the above calculations the overall usability of
these search engines that regards the interaction between the user and the systems is
positive.
50
4.4.2 Search engines usability testing results
Table 7 System Topics Time Correctness Positioning Efficiency Satisfaction
id avg name total total average stddev min max average average stddev correctness positioning stddevrate1 Google 150 35651 237 187 50 1081 0.76 8.37 2.47 0.0032 0.0356 5 1 115 27001 234 183 50 1081
2 Altavista 150 43728 291 279 41 1985 0.79 7.01 2.70 0.0027 0.0231 4 1 119 36065 303 301 41 1985
3 Yahoo 150 41230 274 351 46 3154 0.80 8.12 2.56 0.0029 0.0287 5 1 120 33875 282 385 46 3154
51
With reference to this study the researcher compared the three search engines on
usability measures to estimate their performance.
From the hundred and fifty topics (150) assessed in each search engine the
researcher found that hundred and fifteen (115) where correctly answered by the
participants when using Google search engine, hundred and nineteen (119) when using
AltaVista and hundred and twenty (120) when using Yahoo. On average users spent 3
minutes and 95 seconds to find an answer to a topic on Google, 4 minutes and 85
seconds on AltaVista and 4 minutes and 56 seconds on Yahoo. Thus, when searching on
Google the participants spent relatively less time than in the other search engines but
their answers were not correct when compared to the other systems. The differences
though between found are not significantly important as the correctness found for
Google is 0.76 for AltaVista 0.79 and for Yahoo 0.80. Moreover, when measured
positioning the researcher found that Google ranked the correct documents higher than
the other systems. Yahoo follows with average 8.12 positioning and last is AltaVista
with the significant difference of having positioning only 7.01.
As far as efficiency is regarded, participants needed more time to complete a task
correctly when using Google and less when using Yahoo and AltaVista even though in
ranking terms Google has a better performance, with Yahoo following and then
AltaVista.
On terms of satisfaction, Google and Yahoo rate the same by taking into account
the answers of the survey questions and AltaVista follow with a score of 4 out of 7.
Overall, participants were 35.71% satisfied by Google, 37.71% by Yahoo and
28.58% by AltaVista. They spent on average less time searching for a topic on Google,
more on Yahoo and the most on AltaVista. They provided less correct answers when
using Google, more when using AltaVista and the most when using Yahoo. They needed
more time to complete a task successfully on Google, less on Yahoo and the least of all
52
on AltaVista. The correct answers though given were positioned higher in the ranked list
of the Google, then of Yahoo and finally of AltaVista.
Summary Statistics
Table 8
Google AltaVista Yahoo
Effectivity
Correctness 0.76 0.79 0.80
Positioning 8.37 7.01 8.12
Efficiency 0.0032 0.0027 0.0029
Satisfaction 5 4 5
Having a look in the table above it seems that Google performed better in terms of effectivity positioning, efficiency and satisfaction with Yahoo following next and AltaVista coming last. Thus, it could be said on terms of usability that Google rates higher facilitating the interaction between the user and the system via its interface.
53
54
4.4.3 Topics analysis results Table 9 Topic Topics Time Correctness Positioning Efficiency id title total total average stddev min max average average stddev correctness positioning1 Bengals cat 1515 33053305 220220 133133 100100 564564 1.00 8.00 2.44 0.0045 0.0363 2 Chevrolet Trucks 158 45751843 305230 242168 111111 1049625 0.53 8.37 2.72 0.0017 0.0363 3 Fasting 1512 21611667 144138 5458 6767 285285 0.80 8.66 1.87 0.0055 0.0623 4 Lava lamps 1513 28672573 191197 8183 8181 334334 0.86 7.38 2.66 0.0045 0.0373 5 Tartin 1513 42303975 282305 254266 8989 10431043 0.86 7.30 3.01 0.0030 0.0238 6 Deer 159 41122418 274268 220217 6767 741741 0.60 9.00 1.11 0.0021 0.0334 7 Incandescent light bulb 1514 30772321 205165 16772 5353 756289 0.93 9.00 1.79 0.0045 0.0542 8 Mexican food culture 1511 34001803 226163 16492 4646 644276 0.73 8.90 1.30 0.0032 0.0543 9 Jennifer Aniston 1514 31753043 211217 146150 6868 586586 0.93 8.57 1.39 0.0044 0.0394 10 Pine tree 158 38181848 254231 130129 7777 465461 0.53 8.25 2.54 0.0020 0.0357 11 Auto skoda 1510 79846802 532680 796950 8989 31543154 0.66 8.80 2.04 0.0012 0.0129 12 Nirvana 1515 37513751 250250 179179 8787 683683 1.00 7.40 2.94 0.0039 0.0295 13 Decade of the 1920's 1515 28062806 187187 112112 7575 398398 1.00 6.93 3.21 0.0053 0.0370 14 DNA Testing 157 64264218 428602 497688 9393 19851985 0.46 6.14 2.85 0.0010 0.0101 15 Behavioral Genetics 1514 32803205 218228 124122 7070 484484 0.93 7.78 2.19 0.0042 0.0340 16 Cosmic Events 1511 74726601 498600 555622 119120 21322132 0.73 6.63 3.10 0.0014 0.0110 17 Tropical Storms 1511 33322683 222243 126131 76105 551551 0.73 8.63 1.74 0.0033 0.0354
18 Carbon Monoxide Poisoning 1511 61535313 410483 314334 6464 11241124 0.73 8.45 1.91 0.0017 0.0175
19 UV damage, eyes 1514 27772715 185193 10199 6274 393393 0.93 8.00 2.68 0.0050 0.0412
20 Greek, philosophy, stoicism 15 6657 443 483 67 1808 0.53 5.50 4.10 0.0012 0.0095 8 4630 578 629 67 1808
21 Antibiotics Ineffectiveness 15 3178 211 148 53 531 0.80 8.75 1.91 0.0037 0.0407 12 2575 214 132 60 531
22 Drugs in Golden Triangle 15 3636 242 178 81 660 0.80 8.41 1.88 0.0033 0.0309 12 3267 272 187 81 660
23 Legionnaires' disease 15 1652 110 45 49 192 1.00 9.06 1.33 0.0090 0.0823 15 1652 110 45 49 192
24 Killer Bee Attacks 15 2799 186 129 53 460 0.86 8.30 2.62 0.0046 0.0410 13 2629 202 131 54 460
25 Radio Waves and Brain Cancer 15 3144 209 102 56 415 0.93 6.07 3.14 0.0044 0.0278 14 3057 218 100 56 415
26 Undersea Fiber Optic Cable 15 4212 280 182 88 692 0.60 6.11 3.65 0.0021 0.0183 9 2996 332 217 88 692
27 Risk of Aspirin 15 3590 239 197 41 680 0.60 5.88 4.28 0.0025 0.0212 9 2491 276 235 41 680
28 Metabolism 15 3903 260 194 47 614 0.73 8.00 2.82 0.0028 0.0385 11 2282 207 173 47 614
29 Health and Computer Terminals 15 4252 283 228 78 822 0.86 7.23 3.16 0.0030 0.0242 13 3871 297 243 78 822
30 New Hydroelectric Projects 15 4885 325 273 66 1081 0.86 7.92 2.25 0.0026 0.0223 13 4601 353 283 66 1081
55
Based on the evidence of the table above, the topics that the participant found
rather difficult are: 2. “Chevrolet trucks”, 6. “Deer”, 10. “Pine tree”, 14. “DNA testing”,
20. “Greek philosophy, stoicism”, 26. “ Undersea Fiber Optic Cable”, 27. “Risk of
aspirin”. Taking into account the list constructed by the researcher between generic and
specific topics (Appendix B) it is acceptable that topics 2, 6, 10, 14, and 20 that are
generic could be interpreted in many ways. Thus, their nature explains their score. On
the contrary, topics like 26 and 27 that were specific performed surprisingly bad.
The easiest topic according to the data is topic 23, which had received 15 correct
answers and the participants spent the less time assessing it to the search engines. On the
means of correctness topics 1, 12, 13 and 23 had the best score while topics 2, 10 14 and
20 had the worst.
Moreover, the topics ranked higher in the ranked list of the results, based on the
correct answers are topics 23, 6 and 7 that had an average positioning of 9 out of 10. On
the other hand topics 14, 20, 25, 26 and 27 were found lower in the ranked list of results.
Additionally, a high score of efficiency is gained for topics 7, 8 and 19.
56
4.4.4 User results analysis Table 10
Us Topics Time Correctness Positioning Efficiency Satisfaction er
tota averag stdde aver avg id total min max average stddev correctness positioning stddev l e v age rate
1702138
01 15 113 38 70 215 0.80 8.41 2.10 0.0070 0.0731 5.12 1.04 12 115 42 70 215
2412161
02 15 160 100 68 416 0.73 6.63 2.97 0.0045 0.0453 5.10 2.16 11 146 80 68 265
5582410
93 15 372 258 67 1049 0.80 7.08 2.96 0.0021 0.0206 6.41 0.91 12 342 194 67 683
7216 134643
9
13
44 15 481 411 1808 0.86 6.76 3.72 0.0018 0.0136 5.18 1.55 13 495 435 1808
2578128
85 15 171 82 81 422 0.60 8.88 1.61 0.0034 0.0621 4.95 1.23 9 143 48 81 224
3034249
26 15 202 101 78 425 0.80 9.16 0.93 0.0039 0.0441 5.93 0.69 12 207 110 78 425
5043 106436
5
10
67 15 336 337 1419 0.93 6.50 2.82 0.0027 0.0208 4.85 1.33 14 311 336 1419
2296165
18 15 153 78 69 364 0.73 8.27 2.28 0.0047 0.0551 5.58 1.86 11 150 85 80 364
57
2987200
19 15 199 81 89 384 0.66 7.60 2.79 0.0033 0.0379 5.60 1.19 10 200 90 89 384
6902 138637
0
13
810 15 460 256 1124 0.93 6.28 2.70 0.0020 0.0138 5.18 1.87 14 455 265 1124
2666198
311 15 177 134 60 521 0.73 7.09 3.36 0.0041 0.0393 2.62 0.95 11 180 156 60 521
2933186
412 15 195 114 76 480 0.73 7.45 3.32 0.0037 0.0439 5.02 1.61 11 169 61 76 254
2800253
313 15 186 133 60 564 0.86 8.69 1.60 0.0046 0.0446 4.58 1.48 13 194 141 60 564
8341802
914 15 556 789 53 3154 0.93 8.85 2.17 0.0016 0.0154 4.79 1.45 14 573 816 53 3154
3594 115235
3
11
515 15 239 105 437 0.66 6.70 2.75 0.0027 0.0284 2.93 1.22 10 235 113 437
2096162
416 15 139 113 53 498 0.73 8.90 1.44 0.0052 0.0603 5.75 1.08 11 147 128 54 498
3885232
117 15 259 124 90 476 0.66 8.40 2.36 0.0025 0.0361 5.66 0.63 10 232 121 90 448
4878 139432
0
13
918 15 325 119 520 0.86 6.53 3.38 0.0026 0.0196 4.60 1.59 13 332 126 520
58
3571 102280
5
10
219 15 238 139 566 0.80 8.58 2.57 0.0033 0.0367 4.54 1.45 12 233 129 566
4255 116275
5
11
620 15 283 150 642 0.80 9.25 1.28 0.0028 0.0402 5.93 0.31 12 229 97 414
2282178
321 15 152 45 77 243 0.80 8.08 2.27 0.0052 0.0544 4.95 0.65 12 148 44 77 243
4034 101379
1
10
622 15 268 213 822 0.86 7.23 3.21 0.0032 0.0247 4.43 0.84 13 291 221 822
8078790
923 15 538 438 60 1401 0.93 8.07 2.86 0.0017 0.0142 5.56 1.33 14 564 442 60 1401
24 15 1166 77 30 41 155 0.46 7.71 1.79 0.0060 0.1153 3.89 1.58 7 468 66 26 41 120
59
5539427
425 15 369 460 96 1985 0.73 8.00 2.32 0.0019 0.0205 5.22 2.04 11 388 538 96 1985
4481294
626 15 298 199 46 756 0.80 8.66 1.96 0.0026 0.0353 5.27 1.60 12 245 162 46 625
4082298
427 15 272 177 74 644 0.80 8.66 1.72 0.0029 0.0348 4.62 1.14 12 248 156 74 539
3196312
028 15 213 151 49 547 0.93 7.78 2.86 0.0043 0.0349 5.10 1.32 14 222 152 49 547
4668 145317
9
15
129 15 311 192 865 0.66 7.60 3.77 0.0021 0.0239 5.00 1.14 10 317 214 865
4312419
530 15 287 519 59 2132 0.93 7.57 2.56 0.0032 0.0252 6.06 0.80 14 299 536 59 2132
60
On general terms all users succeeded in assessing all 15 topics each one to the
different search engines. None of them found all the correct answers for the topics
he/she searched for. 6 out of 30 participants managed to find 14 out of 15 correct
answers. Only two participants had collect answers below 10. Participant 23 seems to
spend the most time while conducting the search, but succeeded in finding 14 out of 15
correct answers. On average he/she spent 8 minutes and 9 seconds per topic. On the
other hand participant 24, as extracted from the data, spent the less time (1 minute and
28 seconds per topic) and had a rather bad score regarding correct answers. Moreover, it
is interesting to notice that most participants found answers higher in the ranked list of
the retrieved records.
Satisfaction
From all the measurements defined to measure search engine usability the most
important of them is satisfaction. The table below is a collection of the results given the
users when judging the 3 web search engines.
61
Table 11
Google AltaVista Yahoo
avg (mean) avg (mean) avg (mean)
1. Search options 5.40 4.29 5.02
2. Presentation of
results
4.87 3.92 4.56
3. Relevance of
retrieved documents
4.93 4.03 4.66
4. Response time 6.33 5.30 5.40
5. Satisfaction overall 5.86 4.12 5.01
Total 5.48 4.33 4.93
At it is extracted from the table in means of satisfaction Google performed better
in all areas of usability. Yahoo came second with an average of 4.93 and AltaVista third
with an average of 4.33. (The data from the participants’ answers are provided in
Appendix E)
5 Limitation of the experimental study
5.1 Search engines
The description of the search engines compared in this experiment can be
insufficient. Only data gathered from the search engine web sites were included whereas
no information are provided regarding the search engines ranking algorithms. When
comparing these kind of information retrieval systems, in means of their ability to
retrieve relevant documents and rank more relevant documents higher in the list of
retrieved results, it is important the researcher to acquire more specific information. It is
comprehensible that this kind of information is regarded as business secret. On the other
hand it is essential to realize that only through evaluation can the design of these systems
be improved.
5.2 Participants
When comparing search engines especially in their interface it is important to
have an adequate sample. The initial estimation of the sample was 60 participants, but it
was not easy to be met. By the time this experiment took place, after the ethical approval
from the university authorities was obtained, most of the university students were not
available to take part. Moreover, the fact that the participants were volunteers raises
concerns about their efficiency of conducting the relevance evaluations during the
retrieval performance experiment.
5.3 Test collection
Due to time limitations the idea of constructing the test collection by the
researcher was not possible to be implemented. Thus, the author thought adequate to
build a list based on the TREC test collections that are used for this purposes. Neglecting
the fact that there is a lot of debate concerning the TREC test collections and their use
when comparing interactive information retrieval systems, the researcher decided to
select from three (3) test collections 30 topics based on the students preferences. These
students did not participate to the experiment, as no previous knowledge of the topics
was a prerequisite for the usability evaluation testing. Additionally, the construction of
the test collection list can be questioned. Moreover, the assumption that the participants
had no previous knowledge can not be guaranteed, as the researcher gave them the
opportunity to choose from the list of topics one of their preference.
5.4 Observation data
63
Although observational data were gathered from all the participants, it was rather
difficult to be analyzed because of the time limitations.
5.5 Researcher’s interference
Taking into account the literature concerning the interference of the researcher
(when similar experiments were conducting) the influence that this can have to the
participants cannot be disregarded. The results gathered when participants are supervised
differ from those when participants are being observed.
64
6. Discussion
Even though the results were presented and analyzed in section four (4)
exclusively, satisfying the major objectives of this study, there is one more interesting
point to be taken into consideration. This is the correlation of the two experiments.
As it is extracted from the data analysis in the retrieval performance experiment
Yahoo seemed to perform better than the other two systems. This means that the
retrieval algorithm of Yahoo is more efficient when compared to Google and AltaVista.
Thus, Yahoo can retrieve more relevant documents than the other systems and it can
rank more relevant documents higher in the list of retrieved results. The significant
differences though between Yahoo and Google are low and the results of this experiment
can be questioned, as discussed in the previous section (5. Limitations of the
experimental study). On the hand the results of the usability testing analysis, showed that
Google’s interface is much closer to the users information needs. Specifically, Google
satisfied more of the criteria used for the measurement of usability in comparison to the
other search engines. Presumably, there is a negative correlation between the usability
and the retrieval performance.
The findings of this experiment are in line with Turpin’s (2000) investigations on
the correlation of the two experiments and the results of a case study conducted by Dr.
Roelof van Zwol’s (2004). On the contrary to this study, Turpin (2000) tried also to
explain the reasons of such differences between ‘batch’ evaluations and user-oriented
experiments. One of his most important reasoning to this conflict was that “ users do not
issue queries that take advantage of the increased performance offered by the batch
systems…as users of the improved system issue queries that result in more relevant
documents being ranked higher in the output” (Turpin and Hersh, 2000:230).
65
7. Conclusion
The fundamental reason for the development of web search engines is the
digitization of information resources. In order for search engines to justify their
existence, usage and efficiency an extensive evaluation of these information retrieval
systems has become essential.
In earlier days the importance of such evaluation lied in the ability of search
engines to develop better retrieval algorithms so as to retrieve relevant documents. Thus,
the researchers focused more on the systems mechanisms and tried to analyze them by
means of finding ways to improve it. For this purpose they needed a laboratory
environment, a test collection of documents, a set of queries and a set of relevant
documents. Consequently, by formulating queries the researchers were testing the
algorithms retrieval performance. This approach is representative of the Granfield
paradigm conducted by Cleverdon et al. in 1950’s. Though this perspective was
adequate during the evaluation of information retrieval systems at the time, it could not
be applied solely in today’s highly interactive environment, like the World Wide Web.
Presumably, new approaches had to be developed for the evaluation of modern web
search engines.
Therefore, the focus turned to a more user-centered approach. This approach’s
major objective is according to Borlund (2000) the investigation of the interaction
between the user, the retrieval mechanism and the database when extracting information
under real life conditions. The supporters of this methodology criticized severely the
criteria used by the traditional system-centered approach and highlighted their
inadequacy to evaluate interactive information retrieval systems like the search engines.
Although this approach is more modern and undermines the narrow system-
centered logic it is not so easy to be conducted. Jones and Willett quoted by Vorhees
(2002:355) mention that “a properly designed user-based evaluation must use a
66
sufficient large, representative sample of actual users of the retrieval system; each of the
systems to be compared must be equally well developed and complete with an
appropriate user interface; each subject must be equally well trained on all systems and
care must be taken to control for the learning effect”.
Based on the above token, another perspective has to be adopted that emphasizes
the need for experimentation in a laboratory environment with life situations. As a result,
the horizons are broadening in order to satisfy this need.
Roefol van Zwol (2004) supported this idea and devised an evaluation platform
that combines both user-centered and system-oriented approaches. For the usability
testing he used the measures of “effectivity”, “efficiency” and “satisfaction”. For the
retrieval performance evaluation he remained focused on the traditional measures of
recall and precision changing them slightly.
The researcher is in line with the idea of bringing realism in a laboratory
environment. He/she thus is using the platform built by Roelof van Zwol (2004) to
conduct this study in order to compare the usability and retrieval performance of
Google, AltaVista and Yahoo.
The result of the first part of the experiment – usability testing – revealed the
superiority of Google among the other search engines. Google performed better in
“positioning measure”, (a sub-defining measure of “effectivity” ,which specifies the
ability of the search engine to rank the correct answers given by the participants, in a
high position), by scoring 8.37 out of 10, where as the scores for AltaVista and Yahoo
are 7.01 and 8.12 respectively. In the other part of the experiment which sub-defined
measure of “effectivity”-which is “correctness” (defines the degree to which the answers
to the topics are correct)-Yahoo scored 0.80 out of 1, whereas Google gained 0.76 and
AltaVista 0.79. Moreover, by means of “efficiency” (the ability of successfully
completing a topic related to the time needed) and “satisfaction” (the users’ satisfaction,
which is measured with the survey questionnaire) Google performed better in both with
a score 0.0032 out of 1 and 5 out of 7 respectively. For the same measures, AltaVista
67
had a mean of 0.0027 and 4, whereas Yahoo 0.0029 and 5. Presumably overall, out of
the 4 criteria (including sub-criteria) Google succeeded higher in all three of them.
Consequently, in the usability evaluation Google is ranked first, Yahoo second and
AltaVista third.
On the other hand, comparing the search engine performance, higher precision
gained Yahoo 0.5266, which is a rather good performance, then Google with 0.4861 and
third AltaVista with 0.4622. As a result, the conflict between the usability and the
retrieval performance is confirmed and the findings are in line with the results of Turpin
(2001) and Roelof van Zwol (2004).
As for future work and further development, more experiments like this are
recommended to be conducted with a bigger sample of participants in order to
investigate the reasons of the conflict between usability and retrieval performance
experiments.
68
References
• Beaulieu, M., Robertson, S. and Rasmussen, E. (1996). “Evaluating interactive
systems in TREC”, Journal of the American Society for Information Science,
47(1), 85-94.
• Borlund, P. (2000) “Experimental components for the evaluation of IIR
systems”. Journal of Documentation. 56 (1). 71 –90.
• Chu, H. and Rosenthal, M. (1996). “Search engines for the World Wide Web: a
comparative study and evaluation methodology”, Proceedings of the ASIS
Annual Meeting, 33, 127-135.
• Courtois, M.P., Baer, W. and Stark, M. (1995). “Cool tools for searching the
Web: a performance evaluation”, Online, 19(6), 14-32.
• Ding, W. and Marchionini, G. (1996). “A comparative study of Web search
service performance”, Proceedings of the ASIS Annual Meeting, 33, 136-142.
• Dong, X. and Su, L.T. (1997). “Search engines on the World Wide Web and
information retrieval from the Internet: a review and evaluation”, Online &
CDROM Review, 21(2), 67-68.
• Frokjaer, E., Hertzum, M. & Hornbeak, K. (2000). “Measuring Usability: Are
effectiveness, efficiency, and satisfaction really correlated?”. In: SIGCHI 2000
Hague [Online]. Proceedings of the SIGCHI conference on Human factors in
computing systems. 01 – 06 April 2000, Hague, The Netherlands. New York:
ACM Press. http://muse7.shef.ac.uk/mirror/delivery.acm.org/ [Accessed 14 May
2005].
• Gauch, S. and Guijun Wang (1996). “Information Fusion with Profusion”,
Webnet 96 Conference, San Francisco, CA, October 15-19.
69
• Gordon, M. and Pathak, P. (1999). “Finding information on the World Wide
Web: The retrieval effectiveness of search engines”, Information Processing and
Management, 35(2), 141-180.
Griesbaum, J. (2004). “Evaluation of three German search engines: Altavista.de,
Google.de and Lycos.de”. Information Research, 9(4) paper 189 [Available at
http://InformationR.net/ir/9-4/paper189.html]
•
• Hawking, D., Craswell, N., Bailey, P. and Griffiths, K. (2001). “Measuring
search engine quality”, Information Retrieval, 4, 33-59.
• Hersh, W. & Turpin, A. (2000). “Do Batch and User Evaluations give the same
results?”. In: SIRIG 2000 Athens [Online]. Proceedings of the 23rd annual
international ACM SIGIR conference on Research and development in
information retrieval. 24 – 28 July, 2000, Athens, Greece. New York: ACM
Press. http://muse7.shef.ac.uk/mirror/delivery.acm.org/ [Accessed 14 May 2005].
http://help.yahoo.com/help/us/ysearch/•
http://trec.nist.gov/data.html•
http://www.altavista.com/help/•
http://www.google.com/help/basics.html•
• Jones, S. & Willett, P. (1997). “Readings in information retrieval”. San
Francisco: Morgan Kaufmann Publishers.
• Large, A., Tedd, L. & Hurtley, R. (2001). “Information seeking in the online
age: principles and practice”. München : K. G. Saur.
70
• Leighton, H. V. and Srivastava, J. (1997). “Precision among World Wide Web
Search Services (Search Engines): Alta Vista, Excite, HotBot, InfoSeek,Lycos.
Site visited at: 18/04/02 cybermetrics.cindoc.csic.es
• Leighton, H. V. and Srivastava, J. (1999). “First 20 Precision among World
Wide Web Search Services (Search Engines)” Journal of the American Society
for Information Science, 50 (10), 870-881
• Mullins, C. (2005). “Transaction Log Guidelines”. dbazine [Online] 25 April.
http://www.dbazine.com/sql/sql-articles/mullins-sqlserver [Accessed 5
September 2005].
• Over, P. (2001). “The TREC interactive track: an annotated bibliography”.
Information Processing and Management. 37, 369-381.
• Pors, N. (2000). “Information retrieval, experimental models and statistical
analysis”, Journal of Documentation, 56(1), 55-70.
• Rouet, J.F. (2003). “What was I looking for? The influence of task specificity
and prior knowledge on students search strategies in hypertext”. Interacting with
Computers, 15 (3), 409-428.
• Saracevic, T. (1995). “Evaluation of Evaluation in information retrieval”. In:
SIGIR 1995 Seattle [Online]. Proceedings of the 18th annual international ACM
SIGIR conference on Research and development in information retrieval. 9-13
July 1995, Seattle, USA. New York: ACM Press.
http://muse7.shef.ac.uk/mirror/delivery.acm.org/ [Accessed 14 May 2005].
• Schwartz, C. (1998). “Web search engines”. Journal of the American Society for
Information Science, 49(11), 973-982.
71
• Spink, A. (2002). “A user-centered approach to evaluating human interaction
with Web search engines: an exploratory study”. Information Processing and
Management, 38(3), 401-424.
Su, L. (1992). “Evaluation measures for interactive information retrieval”. •
Information Processing and Management, 28(4), 503-516.
• Su, L. (1994). “The Relevance of Recall and Precision in User Evaluation”.
Journal of the American Society for Information Science, Volume 45(3), 207-
217.
Su, L. (1998). “Value of search results as a whole as the best single measure of
information retrieval performance”.
•
Information Processing and Management,
34(5), 557-579.
• Turpin, A. & Hersh, W. (2001). “Why Batch and User Evaluations do not give
the same results”. In: SIGIR 2001 New Orleans. [Online]. Proceedings of the
24th annual international ACM SIGIR conference on Research and development
in information retrieval. 9-12 September 2001, New Orleans, USA. New York:
ACM Press. http://muse7.shef.ac.uk/mirror/delivery.acm.org/ [Accessed 14 May
2005].
• Van Zwol, R. (2004). “Google’s “I’m Feeling Lucky”, truly a gamble?”. In:
Proceedings lecture notes in computer science 3306: 378-389. Web Information
Systems. Wise 2004.
• Voorhees, E.M. and Harman, D. (2000). “Overview of the sixth Text Retrieval
Conference (TREC-6)”, Information Processing and Management, 36(1), 3-36.
• Zorn, P., Emanoil, M., Marshall, L. and Panek, M. (1996). “Advanced Web
searching: tricks of the trade”, Online, 20(3), 15-28.
72
Appendix A Google Simple Search Interface
Google Advanced Search Interface
73
AltaVista Simple Search Interface
AltaVista Advanced Search Interface
74
Yahoo Simple Search Interface
Yahoo Advanced Search Interface
75
Appendix B
TREC-9 <num> Number: 451 <title> What is a Bengals cat? <desc> Description: Provide information on the Bengal cat breed.
1 <narr> Narrative: Item should include any information on the Bengal cat breed, including description, origin, characteristics, breeding program, names of breeders and catteries carrying bengals. References which discuss bengal clubs only are not relevant. Discussions of bengal tigers are not relevant. <num> Number: 457 <title> Chevrolet Trucks <desc> Description:
2 Find documents that address the types of Chevrolet trucks available. <narr> Narrative: Relevant documents must contain information such as: the length, weight, cargo size, wheelbase, horsepower, cost, etc. <num> Number: 458 <title> fasting <desc> Description: Find documents that discuss fasting for religious reasons. 3 <narr> Narrative: A relevant document discusses fasting as related to periods of religious significance. Relevant documents should state the reason for fasting and the benefits to be derived. <num> Number: 461 <title> lava lamps 4 <desc> Description: Find documents that discuss the origin or operation of lava lamps. <narr> Narrative:
76
A relevant document must contain information on the origin or the operation of the lava lamp. <num> Number: 463 <title> tartin <desc> Description: Find information on Scottish tartans: their history, current use, how they are made, and how to wear them. 5 <narr> Narrative: Simple listings of clan/tartan names or price lists are not relevant. Pictures or descriptions of individual plaids are not relevant unless accompanied by history of their development. <num> Number: 465 <title> deer <desc> Description: What kinds of diseases can infect humans due to contact with deer or consumption of deer meat? 6
<narr> Narrative: Documents explaining the transference of Lyme disease to humans from deer ticks are relevant. <num> Number: 468 <title> incandescent light bulb <desc> Description: Find documents that address the history of the incandescent light bulb. 7 <narr> Narrative: A relevant document must provide information on who worked on the development of the incandescent light bulb. Relevant documents should include locations and dates of the development efforts. Documents that discuss unsuccessful development attempts and non-commercial use of incandescent light bulbs are considered relevant. <num> Number: 471
8 <title> mexican food culture <desc> Description:
77
Find documents that discuss the popularity or appeal of Mexican food outside of the United States. <narr> Narrative: Documents that discuss the popularity of Mexican food in the United States, Central and South America are not relevant. Relevant documents discuss the extent to which Mexican food is enjoyed or used in Europe, Asia, Africa, or Australia. <num> Number: 476 <title> Jennifer Aniston <desc> Description: Find documents that identify movies and/or television programs that Jennifer Aniston has appeared in. 9
<narr> Narrative: Relevant documents include movies and/or television programs that Jennifer Aniston has appeared in. <num> Number: 482 <title> where can i find growth rates for the pine tree? <desc> Description:
10 Find documents that give growth rates of pine trees. <narr> Narrative: Document that give heights of trees but not the rate of growth are not relevant. <num> Number: 484 <title> auto skoda <desc> Description: Skoda is a heavy industrial complex in Czechoslovakia. Does it manufacture vehicles? 11 <narr> Narrative: Relevant documents would include references to historic and contemporary automobile and truck production. Non-relevant documents would pertain to armament production.
12
78
<num> Number: 494 <title> nirvana <desc> Description: Find information on members of the rock group Nirvana. <narr> Narrative: Descriptions of members' behavior at various concerts and their performing style is relevant. Information on who wrote certain songs or a band member's role in producing a song is relevant. Biographical information on members is also relevant. <num> Number: 495 <title> Where can I find information on the decade of the 1920's? <desc> Description: Find information on the decade of the 1920's, known also as the Roaring Twenties. 13 <narr> Narrative: Information on life or happenings during the 1920's decade anywhere in the world is relevant. Simple dates of birth or death in the 1920's are not relevant unless they have broader significance. <num> Number: 500 <title> DNA Testing <desc> Description: This search seeks information on the state of the art of DNA testing; what it is and what its goals are. 14 <narr> Narrative: Relevant documents may discuss those things which are essentially steps in the DNA testing procedure, such as: sequencing, analysis, fingerprinting, and profiling. Documents that provide descriptions of elaborate scientific DNA testing are not relevant.
TREC-8 <num> Number: 402 <title> behavioral genetics
15 <desc> Description: What is happening in the field of behavioral genetics, the study of the relative influence of genetic and environmental factors on an individual's behavior or
79
personality? <narr> Narrative: Documents describing genetic or environmental factors relating to understanding and preventing substance abuse and addictions are relevant. Documents pertaining to attention deficit disorders tied in with genetics are also relevant, as are genetic disorders affecting hearing or muscles. The genome project is relevant when tied in with behavior disorders (i.e., mood disorders, Alzheimer's disease). <num> Number: 405 <title> cosmic events <desc> Description: What unexpected or unexplained cosmic events or celestial phenomena, such as radiation and supernova outbursts or new comets, have been detected? 16
<narr> Narrative: New theories or new interpretations concerning known celestial objects made as a result of new technology are not relevant. <num> Number: 408 <title> tropical storms <desc> Description: What tropical storms (hurricanes and typhoons) have caused significant property damage and loss of life? 17 <narr> Narrative: The date of the storm, the area affected, and the extent of damage/casualties are all of interest. Documents that describe the damage caused by a tropical storm as "slight", "limited", or "small" are not relevant. <num> Number: 417 <title> creativity <desc> Description:
Practice Find ways of measuring creativity. Topic
<narr> Narrative: Relevant items include definitions of creativity, descriptions of characteristics associated with creativity, and factors linked to creativity.
80
<num> Number: 420
18
<title> carbon monoxide poisoning <desc> Description: How widespread is carbon monoxide poisoning on a global scale? <narr> Narrative: Relevant documents will contain data on what carbon monoxide poisoning is, symptoms, causes, and/or prevention. Advertisements for carbon monoxide protection products or services are not relevant. Discussions of auto emissions and air pollution are not relevant even though they can contain carbon monoxide. <num> Number: 427 <title> UV damage, eyes <desc> Description: Find documents that discuss the damage ultraviolet (UV) light from the sun can do to eyes. 19 <narr> Narrative: A relevant document will discuss diseases that result from exposure of the eyes to UV light, treatments for the damage, and/or education programs that help prevent damage. Documents discussing treatment methods for cataracts and ocular melanoma are relevant even when a specific cause is not mentioned. However, documents that discuss radiation damage from nuclear sources or lasers are not relevant. <num> Number: 433
20
<title> Greek, philosophy, stoicism <desc> Description: Is there contemporary interest in the Greek philosophy of stoicism? <narr> Narrative: Actual references to the philosophy or philosophers, productions of Greek stoic plays, and new "stoic" artistic productions are all relevant. <num> Number: 449
21<title> antibiotics ineffectiveness <desc> Description: What has caused the current ineffectiveness of antibiotics against infections and what is the prognosis for new drugs?
81
<narr> Narrative: To be relevant, a document must discuss the reasons or causes for the ineffectiveness of current antibiotics. Relevant documents may also include efforts by pharmaceutical companies and federal government agencies to find new cures, updating current testing phases, new drugs being tested, and the prognosis for the availability of new and effective antibiotics. <num> Number: 415 <title> drugs, Golden Triangle <desc> Description: What is known about drug trafficking in the "Golden Triangle", the area where Burma, Thailand and Laos meet? 22 <narr> Narrative: A relevant document will discuss drug trafficking in the Golden Triangle, including organizations that produce or distribute the drugs; international efforts to combat the traffic; or the quantities of drugs produced in the area. <num> Number: 429 <title> Legionnaires' disease <desc> Description: Identify outbreaks of Legionnaires' disease. 23 <narr> Narrative: To be relevant, a document must discuss a specific outbreak of Legionnaires' disease. Documents that address prevention of or cures for the disease without citing a specific case are not relevant. <num> Number: 430 <title> killer bee attacks <desc> Description: Identify instances of attacks on humans by Africanized (killer) bees. 24 <narr> Narrative: Relevant documents must cite a specific instance of a human attacked by killer bees. Documents that note migration patterns or report attacks on other animals are not relevant unless they also cite an attack on a human.
TREC-6
82
<num> Number: 310 <title> Radio Waves and Brain Cancer <desc> Description: Evidence that radio waves from radio towers or car phones affect brain cancer occurrence. <narr> Narrative:
25 Persons living near radio towers and more recently persons using car phones have been diagnosed with brain cancer. The argument rages regarding the direct association of one with the other. The incidence of cancer among the groups cited is considered, by some, to be higher than that found in the normal population. A relevant document includes any experiment with animals, statistical study, articles, news items which report on the incidence of brain cancer being higher/lower/same as those persons who live near a radio tower and those using car phones as compared to those in the general population. <num> Number: 320 <title> Undersea Fiber Optic Cable <desc> Description: Fiber optic link around the globe (Flag) will be the world's longest undersea fiber optic cable. Who's involved and how extensive is the technology on this system. What problems exist? 26 <narr> Narrative: Relevant documents will reference companies involved in building the system or the technology needed for such an endeavor. Of relevance also would be information on the link up points of FLAG or landing sites or interconnection with other telecommunication cables. Relevant documents may reference any regulatory problems with the system once constructed. A non-relevant document would contain information on other fiber optic systems currently in place. <num> Number: 338
27
<title> Risk of Aspirin <desc> Description: What adverse effects have people experienced while taking aspirin repeatedly? <narr> Narrative: A relevant document should identify any adverse effects experienced from the repeated use of aspirin. Possible effects might include intestinal bleeding,
83
inflammation of the stomach, or various forms of ulcers. The purpose of the individual's repeated aspirin use should also be stated. <num> Number: 349 <title> Metabolism <desc> Description: Document will discuss the chemical reactions necessary to keep living cells healthy and/or producing energy. 28 <narr> Narrative: A relevant document will contain specific information on the catabolic and anabolic reactions of the metabolic process. Relevant information includes, but is not limited to, the reactions occurring in metabolism, biochemical processes (Glycolysis or Krebs cycle for production of energy), and disorders associated with the metabolic rate. <num> Number: 350 <title> Health and Computer Terminals <desc> Description: Is it hazardous to the health of individuals to work with computer terminals on a daily basis?
29 <narr> Narrative: Relevant documents would contain any information that expands on any physical disorder/problems that may be associated with the daily working with computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said to be associated, but how widespread are these or other problems and what is being done to alleviate any health problems. <num> Number: 307 <title> New Hydroelectric Projects <desc> Description: Identify hydroelectric projects proposed or under construction by country and location. Detailed description of nature, extent, purpose, problems, and consequences is desirable.
30
<narr> Narrative: Relevant documents would contain as a minimum a clear statement that a hydroelectric project is planned or construction is under way and the location of the project. Renovation of existing facilities would be judged not relevant
84
unless plans call for a significant increase in acre-feet or reservoir or a marked change in the environmental impact of the project. Arguments for and against proposed projects are relevant as long as they are supported by specifics, including as a minimum the name or location of the project. A statement that an individual or organization is for or against such projects in general would not be relevant. Proposals or projects underway to dismantle existing facilities or drain existing reservoirs are not relevant, nor are articles reporting a decision to drop a proposed plan.
Topic Number Generic Specific
1 √ 2 √ 3 √ 4 √ 5 √ 6 √ 7 √ 8 √ 9 √ 10 √ 11 √ 12 √ 13 √ 14 √ 15 √ 16 √ 17 √ 18 √ 19 √ 20 √ 21 √ 22 √ 23 √ 24 √ 25 √ 26 √ 27 √ 28 √ 29 √ 30 √
85
Appendix C
Model Participant Consent Form Title of Project: Usability Evaluation and Retrieval Performance of 3 web search engines using TERS platform Name of Researcher: Kessopoulou Eftychia Participant Identification Number for this project: Please initial box 1. I confirm that I have read and understand the information sheet dated: 20 May 2005 for the above project and have had the opportunity to ask questions. 2. I understand that my participation is voluntary and that I am free to withdraw at any time without giving any reason.
3. I understand that my responses will be anonymised before analysis. I give permission for members of the research team to have access to my anonymised responses. 4. I agree to take part in the above project. ________________________ ________________ ____________________ Name of Participant Date Signature _________________________ ________________ ____________________ Name of Person taking consent Date Signature (if different from researcher) _________________________ ________________ ____________________ Researcher Date Signature
86
Copies: One copy for the participant and one copy for the Principal Investigator / Supervisor.
Appendix D Retrieval system identification The file has the following format : <system name> <TAB> <url><NEWLINE> Search engine links loaded on TERS platform Google http://www.google.com AltaVista http://www.altavista.com Yahoo http://www.yahoo.com Topic Construction sample of the original .txt file The file has the following format: <topic id> <TAB> <title> <TAB> <description> <TAB> <narrative> <TAB> <search phrase><NEWLINE>. 0 Creativity Find ways of measuring creativity Relevant items include definitions of creativity, descriptions of characteristics associated with creativity, and factors linked to creativity measurement of creativity 1 Bengals cat Provide information on the Bengal cat breed Item should include any information on the Bengal cat breed, including description, origin, characteristics, breeding program, names of breeders and catteries carrying bengals. References which discuss bengal clubs only are not relevant. Discussions of bengal tigers are not relevant bengal cat breeding 2 Chevrolet Trucks Find documents that address the types of Chevrolet trucks available Relevant documents must contain information such as: the length, weight, cargo size, wheelbase, horsepower, cost, etc types of Chevrolet trucks Survey questions
General questions (Pre-session Questionnaire II) ID Type Question
Low -> High 1 scaled What kind of search engine user are you?
Beginner --> Expert 2 scaled How often do you use the Internet to find information?
Monthly --> Daily 3 scaled Does your search for information in that case always lead to satisfying
results? Never --> Always
4 scaled How often do you use the advanced options when you search using the
87
web search engines? Never --> Always
5 scaled I know what "Boolean search" is. Strongly Disagree --> Strongly Agree
6 scaled I use "Boolean search". Strongly Disagree --> Strongly Agree
7 scaled I know what a "ranked list" is. Strongly Disagree --> Strongly Agree
8 scaled The "advanced" search mode in the web search engines is an effective tool. Strongly Disagree --> Strongly Agree
Survey questions assessed in all three search engines (Post session Questionnaire) 9 scaled Are you satisfied with the search options (simple/advance search) offered
by the web search engine? Not satisfied --> Satisfied
10 scaled Are you satisfied with the flexibility of the web search engine to assist the user by filling in the search terms? Not satisfied --> Satisfied
11 scaled Are you satisfied when the web search engine refines the search (provide relevant results that narrow down the search)? Not satisfied --> Satisfied
12 scaled Are you satisfied with the searching of phrases in this search engine? Not satisfied --> Satisfied
13 scaled Are you satisfied with the presentation of the results by this search engine? Not satisfied --> Satisfied
14 scaled What do you think about the reading characters on the screen? Not satisfied --> Satisfied
15 scaled What do you think about the organization of the information on the screen? Not satisfied --> Satisfied
16 Boolean Do you prefer the search terms being highlighted and/or in bold in the search results? Yes --> No
17 scaled Do you think that the search terms highlighted or in bold help you to filter out the most useful results? Strongly disagree --> Strongly agree
18 scaled Are you satisfied when there is information helping you to know where you are? Not satisfied --> Satisfied
19 scaled Are you satisfied with the ranking of the results? Not satisfied --> Satisfied
20 scaled Are you satisfied with the relevance of the retrieved documents? Not satisfied --> Satisfied
21 scaled Are you satisfied with the response time of the search engine? Not satisfied --> Satisfied
22 scaled Are you satisfied with using this web search engine for your search as a
88
whole? Not satisfied --> Satisfied
23 scaled Are you satisfied with the design of this web search engine? Not satisfied --> Satisfied
24 scaled Are you satisfied when the web search engine informs you about its progress? (e.g. Results 1 - 10 of about 16,000,000 for air pollution - 0.09 sec) Not satisfied --> Satisfied
25 scaled Are you satisfied with the ease-of-use of the search engine? Not satisfied --> Satisfied
26 open Do you have any additional remarks regarding the usability of the web search engine? - --> -
27 Boolean Did you use the advanced search? Yes --> No
28 open Are you interrupted while participating in the experiment? (e.g. phone rang, connection failed) - --> -
(Search options: Questions 9-12, Result Presentation: Questions 13-19, Relevance: Question 20, Response time: Question 21, Satisfaction Overall: 22-28) Usability Workload Participant ID Topic ID System ID 1 0 1 1 1 1 1 2 2 2 0 1 2 1 2 2 2 3 3 0 2 3 1 3 3 2 1 4 0 1 4 16 1 4 17 2 5 0 1 5 16 2 5 17 3 6 0 2 6 16 3 6 17 1
89
Retrieval Performance Workload Participant ID Topic ID 1 18 2 4 3 9 4 22 5 24 6 29 7 25 8 17 9 2 10 30 11 12 12 20 13 23 14 28 15 13 16 21 17 1 18 19 19 6 20 8 21 10 22 27 23 15 24 7 25 14 26 3 27 5 28 16 29 11
90
30 26 Run set-up The file uploaded had the following format: <system> <TAB> <topic> <TAB> <rank> <TAB> <uri> <NEWLINE>. System Topic Rank Uri 1 18 1 http://www.carbonmonoxidekills.com/1 18 2 http://www.carbon-monoxide-poisoning.com/1 18 3 http://www.nlm.nih.gov/medlineplus/carbonmonoxidepoisoning.html1 18 4 http://www.epa.gov/iaq/pubs/coftsht.html1 18 5 http://www.emedicinehealth.com/articles/13442-1.asp1 18 6 http://www.cdc.gov/nceh/airpollution/carbonmonoxide/default.htm1 18 7 http://www.cdc.gov/nceh/airpollution/carbonmonoxide/checklist.htm1 18 http://www.osha.gov/OshDoc/data_General_Facts/carbonmonoxide-8 factsheet.pdf1 18 9 http://www.postgradmed.com/issues/1999/01_99/tomaszewski.htm
91
Pre-session questionnaire (I)
1. Age: □ below 20 □ 21-25 □ 26-30 □ 31-35 □ over 35
2. Gender: □ Male □ Female
3. You are: □ Undergraduate □ Postgraduate □ Research Student
4. Academic program that you are attending:
5. How many years have you be been using web search engines? □ below 1 □ 1 □ 1-2 □ 3-4 □ 5-6 □ 7-8 □ 9-10 □ over 10
6. Which search engine do you use more often? (please choose one option only) □ Google □ Altavista □ Yahoo □ Alltheweb □ EntireWeb □ Search.com
7. How do you alter your search if you are not satisfied with the results you get?
(please choose one option only) □ do nothing/give up □ ask somebody for help □ use the “help” option that the web search engines provide □ change the web search engine □ use the “advanced” search mode □ change the query slightly □ change the query completely
92
Appendix E
Age
Cumulative Frequency Percent Valid Percent Percent
21-25 16 53,3 53,3 53,3 26-30 13 43,3 43,3 96,7 over 35 1 3,3 3,3 100,0
Valid
Total 30 100,0 100,0 Gender
Cumulative Frequency Percent Valid Percent Percent
Male 19 63,3 63,3 63,3 Female 11 36,7 36,7 100,0
Valid
Total 30 100,0 100,0 Postgraduate/Research Students
Cumulative Frequency Percent Valid Percent Percent
Postgraduate 26 86,7 86,7 86,7 Research Student 4 13,3 13,3 100,0
Valid
Total 30 100,0 100,0 Frequency of internet using to find information
Cumulative Frequency Percent Valid Percent Percent
5 2 6,7 6,7 6,76 6 20,0 20,0 26,77 22 73,3 73,3 100,0
Valid
Total 30 100,0 100,0 Scale 1-7 High: Daily Low: Monthly
93
Search and Satisfying results
Cumulative Frequency Percent Valid Percent Percent
3 1 3,3 3,3 3,34 2 6,7 6,7 10,05 11 36,7 36,7 46,76 12 40,0 40,0 86,77 4 13,3 13,3 100,0
Valid
Total 30 100,0 100,0 Scale 1-7 High: Always Low: Never Frequency of Advanced option used
Cumulative Frequency Percent Valid Percent Percent
1 4 13,3 13,3 13,32 7 23,3 23,3 36,73 3 10,0 10,0 46,74 2 6,7 6,7 53,35 4 13,3 13,3 66,76 7 23,3 23,3 90,07 3 10,0 10,0 100,0
Valid
Total 30 100,0 100,0 Scale 1-7 High: Always Low: Never Knowledge of Boolean Search
Cumulative Frequency Percent Valid Percent Percent
1 16 53,3 53,3 53,34 2 6,7 6,7 60,05 2 6,7 6,7 66,76 3 10,0 10,0 76,77 7 23,3 23,3 100,0
Valid
Total 30 100,0 100,0 Scale 1-7 High: Strongly agree Low: Strongly disagree
94
Use of Boolean Search
Cumulative Frequency Percent Valid Percent Percent
1 18 60,0 60,0 60,03 2 6,7 6,7 66,74 2 6,7 6,7 73,35 4 13,3 13,3 86,76 1 3,3 3,3 90,07 3 10,0 10,0 100,0
Valid
Total 30 100,0 100,0 Scale 1-7 High: Strongly agree Low: Strongly disagree Knowledge of ranked list
Cumulative Frequency Percent Valid Percent Percent
1 10 33,3 33,3 33,32 1 3,3 3,3 36,73 2 6,7 6,7 43,34 2 6,7 6,7 50,05 2 6,7 6,7 56,76 4 13,3 13,3 70,07 9 30,0 30,0 100,0
Valid
Total 30 100,0 100,0 Scale 1-7 High: Strongly agree Low: Strongly disagree Utility of advance search
Frequency Cumulative
Percent Valid Percent Percent 3 5 16,7 16,7 16,74 7 23,3 23,3 40,05 4 13,3 13,3 53,36 8 26,7 26,7 80,07 6 20,0 20,0 100,0
Valid
Total 30 100,0 100,0 Scale 1-7 High: Strongly agree Low: Strongly disagree
95
Survey Questions (Post-session Questionnaire) Question 9
system id system name # of answers average std dev OVERALL 90 4.96 1.40 3 Yahoo 30 4.96 1.35 1 Google 30 5.50 1.25 2 AltaVista 30 4.43 1.43
Question 10
system id system name # of answers average std dev OVERALL 90 4.92 1.34 1 Google 30 5.46 1.22 2 AltaVista 30 4.43 1.40 3 Yahoo 30 4.86 1.22
Question 11
system id system name # of answers average std dev OVERALL 90 4.94 1.43 3 Yahoo 30 5.30 1.26 1 Google 30 5.33 1.39 2 AltaVista 30 4.20 1.37
Question 12
system id system name # of answers average std dev OVERALL 90 4.86 1.45 1 Google 30 5.53 1.22 2 AltaVista 30 4.10 1.47 3 Yahoo 30 4.96 1.32
Question 13
96
system id system name # of answers average std dev OVERALL 90 4.87 1.68 3 Yahoo 30 5.36 1.47 1 Google 30 5.46 1.54 2 AltaVista 30 3.80 1.54
Question 14
system id system name # of answers average std dev OVERALL 90 5.13 1.55 1 Google 30 5.86 1.10 2 AltaVista 30 4.23 1.56 3 Yahoo 30 5.30 1.51
Question 15
system id system name # of answers average std dev OVERALL 90 4.93 1.74 3 Yahoo 30 5.16 1.59 1 Google 30 5.53 1.61 2 AltaVista 30 4.10 1.74
97
Question 16
system id system name # of answers average std dev OVERALL 90 0.82 0.57 1 Google 30 0.86 0.50 2 AltaVista 30 0.80 0.61 3 Yahoo 30 0.80 0.61
Question 17
system id system name # of answers average std dev OVERALL 90 5.32 1.57 3 Yahoo 30 5.16 1.74 1 Google 30 5.66 1.47 2 AltaVista 30 5.13 1.50
Question 18
system id system name # of answers average std dev OVERALL 90 5.56 1.32 1 Google 30 5.73 1.33 2 AltaVista 30 5.50 1.38 3 Yahoo 30 5.46 1.27
98
Question 19
system id system name # of answers average std dev OVERALL 90 4.54 1.55 3 Yahoo 30 4.70 1.48 1 Google 30 5.03 1.32 2 AltaVista 30 3.90 1.64
Question 20
system id system name # of answers average std dev OVERALL 90 4.54 1.45 1 Google 30 4.93 1.33 2 AltaVista 30 4.03 1.49 3 Yahoo 30 4.66 1.42
Question 21
system id system name # of answers average std dev OVERALL 90 5.67 1.39 3 Yahoo 30 5.40 1.49 1 Google 30 6.33 0.92 2 AltaVista 30 5.30 1.48
99
Question 22
system id system name # of answers average std dev OVERALL 90 4.77 1.71 1 Google 30 5.83 1.34 2 AltaVista 30 3.76 1.79 3 Yahoo 30 4.73 1.33
Question 23
system id system name # of answers average std dev OVERALL 90 4.95 1.66 3 Yahoo 30 5.00 1.38 1 Google 30 6.00 1.14 2 AltaVista 30 3.86 1.69
Question 24
system id system name # of answers average std dev OVERALL 90 5.14 1.59 1 Google 30 5.43 1.56 2 AltaVista 30 4.83 1.70 3 Yahoo 30 5.16 1.51
100
Question 25
system id system name # of answers average std dev OVERALL 90 5.12 1.54 3 Yahoo 30 5.16 1.39 1 Google 30 6.16 0.87 2 AltaVista 30 4.03 1.49
101