26
THE HUMAN-COMPUTER INTERACTION HANDBOOK Fundamentals, Evolving Technologies and Emerging Applications JULIE A. JACKO, Editor Georgia Institute of Technology ANDREW SEARS, Editor UMBC LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS 2003 Mahwah, New Jersey London

User Based Evaluations

Embed Size (px)

DESCRIPTION

User Based Evaluations

Citation preview

Page 1: User Based Evaluations

THE HUMAN-COMPUTER

INTERACTION HANDBOOKFundamentals, Evolving Technologies

and Emerging Applications

JULIE A. JACKO, EditorGeorgia Institute of Technology

ANDREW SEARS, EditorUMBC

LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS2003 Mahwah, New Jersey London

Page 2: User Based Evaluations

56

USER-BASED EVALUATIONS

Joseph S. DumasOracle Corporation

Introduction 1094User-Administered Questionnaires 1094

Off-the-Shelf Questionnaires 1095Observing Users 1096Empirical Usability Testing 1096

The Focus Is on Usability 1097The Participants Are End Users or Potential EndUsers 1098There Is a Product or System to Evaluate 1099The Participants Think Aloud As They Perform Tasks... 1099The Participants Are Observed, and DataAre Recorded and Analyzed 1102Measures and Data Analysis 1105Communicating Test Results 1107

Variations on the Essentials 1108Measuring and Comparing Usability 1108Comparing the Usability of Products 1108Baseline Usability Tests 1110

Allowing Free Exploration 1110Challenges to the Validity of Usability Testing 1110

How Do We Evaluate Usability Testing? 1110Why Can't We Map Usability Measuresto User Interface Components? 1111Are We Ignoring the Operational Environment? 1111Why Don't Usability Specialists See the SameUsability Problems? 1112

Additional Issues 1112How Do We Evaluate Ease of Use? 1112How Does Usability Testing Compare With OtherEvaluation Methods? 1112Is It Time to Standardize Methods? 1113Are There Ethical Issues in User Testing? 1114Is Testing Web-Based Products Different? 1114The Future of Usability Testing 1115

Which User-Based Method to Use? 1115References 1115

1093

Page 3: User Based Evaluations

1094 • DUMAS

INTRODUCTION

Over the past 20 years, there has been a revolution in the wayproducts, especially high-tech products, are developed. It is nolonger accepted practice to wait until the end of developmentto evaluate a product. That revolution applies to evaluating us-ability. As the other chapters in this handbook show, evaluationand design now are integrated. Prototyping software and theacceptance of paper prototyping make it possible to evaluatedesigns as early concepts, then throughout the detailed designphases. User participation is no longer postponed until just be-fore the product is in its final form. Early user involvement hasblurred the distinction between design and evaluation. Brief us-ability tests are often part of participatory design sessions, andusers are sometimes asked to participate in early user interfacedesign walkthroughs. Although the focus of this chapter is onuser-based evaluation methods, I concede that the boundarybetween design methods and evaluation methods grows lessdistinct with time.

In this chapter, I focus on user-based evaluations, which areevaluations in which users directly participate. But the bound-ary between user-based and other methods is also becoming lessdistinct. Occasionally, usability inspection methods and user-based methods merge, such as in the pluralistic walkthrough(Bias, 1994). In this chapter, I maintain the somewhat artifi-cial distinction between user-based and other evaluation meth-ods to treat user-based evaluations thoroughly. I describe threeuser-based methods: user-administered questionnaires, observ-ing users, and empirical usability testing. In the final section ofthe chapter, I describe when to use each method.

USER-ADMINISTERED QUESTIONNAIRES

A questionnaire can be used as a stand-alone measure of usabil-ity, or it can be used along with other measures. For example, aquestionnaire can be used at the end of a usability test to mea-sure the subjective reactions of the participant to the producttested, or it can be used as a stand-alone usability measure of theproduct. Over the past 20 years, there have been questionnairesthat

• Measure attitudes toward individual products

• Break attitudes down into several smaller components, suchas ease of learning

• Measure just one aspect of usability (Spenkelink, Beuijen, &Brok, 1993)

• Measure attitudes that are restricted to a particular technology,such as computer software

• Measure more general attitudes toward technology or com-puters (Igbaria & Parasuraman, 1991)

• Are filled out after using a product only once (Doll &Torkzadeh, 1988)

• Assume repeated use of a product• Require a psychometrician for interpretation of results

(Kirakowski, 1996)

• Come with published validation studies

• Provide comparison norms to which one can compare results

Throughout this history, there have been two objectivesfor questionnaires developed to measure usability: (a) createa short questionnaire to measure users' subjective evaluationof a product, usually as part of another evaluation method, and(b) create a questionnaire to provide an absolute measure of thesubjective usability of a product. This second objective parallelsthe effort in usability testing to find an absolute measure of us-ability, that is, a numerical measure of the usability of a productthat is independent of its relationship to any other product.

Creating a valid and reliable questionnaire to evaluate us-ability takes considerable effort and specialized skills, skills inwhich most usability professionals don't receive training. Thesteps involved in creating an effective questionnaire include thefollowing:

• Create a number of questions or ratings that appear to tapattitudes or opinions that you want to measure. For example,the questions might focus on a product's overall ease of use.At the beginning of the process, the more questions you cancreate, the better.

• Use item analysis techniques to eliminate the poor ques-tions and keep the effective ones. For example, if you asked asample of users to use a product and then answer the questions,you could compute the correlation between each question andthe total score of all of the questions. You would eliminate ques-tions with low correlations. You would also eliminate questionswith small variances because nearly all of the respondents areselecting the same rating value or answer. You would also lookfor high correlations between two questions because this indi-cates that the questions may be measuring the same thing. Youcould then eliminate one of the two.

• Assess the reliability of the questionnaire. For example,you could measure test-retest reliability by administering thequestionnaire twice to the same respondents, but far enoughapart in time that respondents would be unlikely to remembertheir answers from the first time. You could also measure splithalf reliability by randomly assigning each question to one of twosets of questions, then administering both sets and computingthe correlation between them (Gage & Berliner, 1991).

• Assess the validity of the questionnaire. Validity is the mostdifficult aspect to measure but is an essential characteristic ofa questionnaire (Chignell, 1990). A questionnaire is valid whenit measures what it is suppose to measure, so a questionnairecreated to measure the usability of a product should do just that.Demonstrating that it is valid takes some ingenuity. For example,if the questionnaire is applied to two products that are knownto differ on usability, the test scores should reflect that differ-ence. Or test scores from users should correlate with usabilityjudgments of experts about a product. If the correlations arelow, either the test is not valid or the users and the experts arenot using the same process. Finally, if the test is valid, it shouldcorrelate highly with questionnaires with known validity, suchas the usability questionnaires discussed in the following sub-sections.

Page 4: User Based Evaluations

56. User-Based Evaluations • 1095

Off-the-Shelf Questionnaires

Because an effective questionnaire takes time and special skillsto develop, usability specialists have been interested in usingoff-the-shelf questionnaires that they can borrow or purchase.The advantages of using a professionally developed question-naire are substantial. These questionnaires usually have beendeveloped by measurement specialists who assess the validityand reliability of the instrument as well as the contribution ofeach question.

Historically, there have been two types of questionnaires de-veloped: (a) short questionnaires that can be used to obtain aquick measure of users' subjective reactions, usually to a prod-uct that they have just used for the first time, and (b) longerquestionnaires that can be used alone as an evaluation methodand that may be broken out into more specific subscales.

Short Questionnaires. There have been a number of pub-lished short questionnaires. A three-item questionnaire was de-veloped by Lewis (1991). The three questions measure theusers' judgment of how easily and quickly tasks were completed.The Software Usability Scale (SUS) has 10 questions (Brooke,1996). It can be used as a stand-alone evaluation or as part of auser test. It can be applied to any product, not just software. Itwas created by a group of professionals then working at Digi-tal Equipment Corporation. The 10 SUS questions have a Likertscale format—a statement followed by a five-level agreementscale. For example,

I think that I would like to use this system frequently.Strongly StronglyDisagree Agree

1 2 3 4 5

Brooke (1996) described the scale and the scoring system,which yields a single, 100-point scale.

A somewhat longer questionnaire is the Computer User Satis-faction Inventory (CUSI). It was developed to measure attitudestoward software applications (Kirakowski & Corbett, 1988). Ithas 22 questions that break into two subscales: affect (the de-gree to which respondents like the software) and competence(the degree to which respondents feel they can complete taskswith the product).

Stand-Alone Questionnaires. These questionnaires weredeveloped to measure usability as a stand-alone method. Theyhave many questions and attempt to break users' attitudes intoa number of subscales. The Questionnaire for User InteractionSatisfaction (QUIS) was developed at the Human-Computer In-teraction Lab (HCIL) at the University of Maryland at CollegePark (Chin, Diehl, & Norman, 1988). QUIS was designed to as-sess users' subjective satisfaction with several aspects of thehuman-computer interface. It has been used by many eval-uators over the past 10 years, in part because of its inclu-sion in Shneiderman's (1997) editions. It consists of a set ofgeneral questions, which provide an overall assessment of aproduct, and a set of detailed questions about interface compo-nents. Version 7.0 of the questionnaire contains a set of demo-graphic questions, a measure of overall system satisfaction, and

hierarchically organized measures of 11 specific interface fac-tors: screen factors, terminology and system feedback, learningfactors, system capabilities, technical manuals, online tutorials,multimedia, voice recognition, virtual environments, Internetaccess, and software installation.

Because QUIS's factors are not always relevant to every prod-uct, practitioners often select a subset of the questions to useor use only the general questions. There is a long form of QUIS(71 questions) and a short form (26 questions). Each questionuses a 9-point rating scale, with the end points labeled withadjectives. For example,

Characters on the screen are:Hard to read Easy to read1 2 3 4 5 6 7 8 9

There is a Web site for QUIS (www.lap.umd.edu/QUIS/index.html). Licenses for use are available for a few hundred dollars.The site also contains references to evaluations that have usedQUIS.

The Software Usability Measurement Inventory (SUMI) wasdeveloped to evaluate software only (Kirakowski, 1996). It is awell-constructed instrument that breaks the answers into sixsubscales: global, efficiency, affect, helpfulness, control, andlearnability.

The Global subscale is similar to QUIS's general questions.The SUMI questionnaire consists of 50 statements to whichusers reply that they either agree, are undecided, or disagree.For example:

• This software responds too slowly to inputs.

• The instructions and prompts are helpful.

• The way that system information is presented is clear andunderstandable.

• I would not like to use this software every day.

Despite it length, SUMI can be completed in about 5 minutes.It does assume that the respondents have had several sessionsworking with the software. SUMI has been applied not only tonew software under development, but also to compare softwareproducts and to establish a usability baseline. SUMI has beenused in development environments to set quantitative goals,track achievement of goals during product development, andhighlight good and bad aspects of a product.

SUMI's strengths come from its thorough development. Itsvalidity and reliability have been established. In addition, itsdevelopers have created norms for the subscales so that youcan compare your software against similar products. For exam-ple, you could show that the product you are evaluating scoredhigher than similar products on all of the subscales. The normscome from several thousand respondents.

The questionnaire comes with a manual for scoring the ques-tions and using the norms. The developers recommend thatthe test be scored by a trained psychometrician. For a fee, thedeveloper will do the scoring and the comparison with norms.The license comes with 50 questionnaires in the language ofyour choice, a manual, and software for scoring the results and

Page 5: User Based Evaluations

1096 • DUMAS

creating reports. The Web site for SUMI is http://www.ucc.ie/hfrg/questionnaires/sumi/index.html.

Questionnaires can play an important role in a toolkit of us-ability evaluation methods. It is difficult to create a good one,but there are several that have been well constructed and ex-tensively used. The short ones can be used as part of other eval-uation methods, and for most usability specialists, using them ispreferable to creating their own. The longer ones can be usedto establish a usability baseline and to track progress over time.Even the longest questionnaires can be completed in 10 min-utes or less. Whether any of these questionnaires can providean absolute measure of usability remains to be demonstrated.

OBSERVING USERS

Although observing users is a component of many evaluationmethods, such as watching users through a one-way mirror dur-ing a usability test, this section focuses on observation as a stand-alone evaluation method. Some products can only be evaluatedin their use environment, where the most an evaluator can dois watch the participants. Indeed, one could evaluate any prod-uct by observing its use and recoding what happens. For ex-ample, if you were evaluating new software for stock trading,you could implement it and then watch trading activity as itoccurs.

Unfortunately, observation has several limitations when usedalone (Baber & Stanton, 1996), including the following:

• It is difficult to infer causality while observing any behav-ior. Because the observer is not manipulating the events thatoccur, it is not always clear what caused a behavior.

• The observer is unable to control when events occur. Hence,important events may never occur while the observer iswatching. A corollary to this limitation is that it may take along time to observe what you are looking for.

• Participants change their behavior when they know they arebeing observed. This problem is not unique to observation;in fact it is a problem with any user-based evaluation method.

• Observers often see what they want to see, which is a directchallenge to the validity of observation.

Baber and Stanton provide guidelines for using observationas an evaluation method.

A method related to both observation and user testing isprivate camera conversation (DeVries, Hartevelt, & Oosterholt,1996). Its advocates claim that participants enjoy this methodand that it yields a great deal of useful data. The method requiresonly a private room and a video camera with a microphone. Itcan be implemented in a closed booth at a professional meeting,for example. The participant is given a product and asked togo into the room and, when ready, turn on the camera andtalk. The instructions on what to talk about are quite general,such as asking them to talk about what they like and dislikeabout the product. The sessions are self-paced but quite short(5-10 minutes). As with usability testing, the richness of the

verbal protocol is enhanced when two or more people whoknow each other participate together.

The product of the sessions is a videotape that must bewatched and analyzed. Because the participants are allowed tobe creative and do not have to follow a session protocol, it isdifficult to evaluate the usability of a product with this method.

A related method has been described by Bauersfeld andHalgren (1996). The rationale behind this passive video observa-tion is the assumption that a video camera can be less intrusivethan a usability specialist but more vigilant. In this method, avideo camera is set up in the user's work environment. There isa second camera or a scan converter that shows what is on theuser's screen or desk surface. The two images are mixed andrecorded. Participants are told to ignore the cameras as much aspossible and to work as they normally would. Participants areshown how to turn on the equipment and told to do so when-ever they work. This method can be used during any stage ofproduct development and not just for evaluation.

Although passive video capture is done without a usabilityspecialist present, we still don't know whether participants actdifferently because they know they are being taped. In addition,the data must be extracted from the videotapes, which takes asmuch time to watch as if the participant were being observeddirectly. Still, this method can be used in situations in which anobserver can't be present when users are working.

EMPIRICAL USABILITY TESTING

Usability testing began in the early 1980s at a time when com-puter software was beginning to reach a wider audience thanjust computing professionals. The explosion of end user com-puting was made possible by new hardware and software inthe form of both the mini- and microcomputer and expansionof communications technology, which moved computing fromthe isolated computer room to the desktop. The advent of thecathode ray tube (CRT) and communications technology madeit possible to interact directly with the computer in real time.

The 1982 conference, Human Factors in Computer Sys-tems, held at Gaithersburg, Maryland, brought together forthe first time professionals interested in studying and under-standing human-computer interaction. Subsequent meetings ofthis group became known as the Computer-Human Interaction(CHI) Conference. At that first meeting, there was a sessionon evaluating text editors that described early usability tests(Ledgard, 1982). The reports of these studies were written inthe style of experimental psychology reports, including sectionstitled "Experimental Design" and "Data Analysis," in which thecomputation of inferential statistics was described.

But the reliance on psychological research experiments asa model for usability testing was challenged early. Young andBarnard (1987) proposed the concept of scenarios instead ofexperiments, and 2 years later, CHI Conference writers werediscussing issues such as "The Role of Laboratory Experimentsin HCI: Help, Hindrance or Ho-Hum?"(Wolf, 1989).

The first books on HCI began appearing at this time. Perhapsthe most influential book on usability, Shneiderman's (1987)first edition of Designing the User Interface, did not have a

Page 6: User Based Evaluations

56. User-Based Evaluations • 1097

section or index item for usability testing but did have one onquantitative evaluations. In that section, Shneiderman wrote thefollowing:

Scientific and engineering progress is often stimulated by improvedtechniques for precise measurement. Rapid progress in interactive sys-tems design will occur as soon as researchers and practitioners evolvesuitable human performance measures and techniques.... Academicand industrial researchers are discovering that the power of traditionalscientific methods can be fruitfully employed in studying interactivesystems, (p. 411)

In the 1992 edition, there again was no entry in the indexfor usability testing, but there is one for usability laboratories.Shneiderman described usability tests but called them "pilottests." These tests "can be run to compare design alternatives,to contrast the new system with current manual procedures, orto evaluate competitive products" (p. 479).

In the 1997 edition, there is a chapter section on usabilitytesting and laboratories. Shneiderman wrote:

Usability-laboratory advocates split from their academic roots as thesepractitioners developed innovative approaches that were influencedby advertising and market research. While academics were developingcontrolled experiments to test hypotheses and support theories, prac-titioners developed usability-testing methods to refine user interfacesrapidly (p. 128)

This brief history shows that usability testing has been anestablished evaluation method for only about 10 years. The re-search studies by Virzi (1990,1992) on the relatively small num-ber of participants needed in a usability test gave legitimacy tothe notion that a usability test could identify usability problemsquickly. Both of the book length descriptions of usability testing(Dumas & Redish, 1993; Rubin, 1994) explicitly presented us-ability testing as a method separate from psychological research.Yet, as discussed later in this chapter, comparisons between us-ability testing and research continue. The remaining sections onusability testing cover usability testing basics, important varia-tions on the essentials, challenges to the validity of user testing,and additional issues.

Valid usability tests have the following six characteristics.

• The focus is on usability.• The participants are end users or potential end users.• There is some artifact to evaluate, such as a product design, a

system, or a prototype of either.• The participants think aloud as they perform tasks.• The data are recorded and analyzed.• The results of the test are communicated to appropriate

audiences.

The Focus Is on Usability

It may seem like an obvious point that a usability test should beabout usability, but sometimes people try to use a test for other,inappropriate purposes or call other methods usability tests.Perhaps the most common mismatch is between usability and

marketing and promotional issues, such as adding a question toa posttest questionnaire asking participants if they would buythe product they just used. If the purpose of the question is toprovide an opportunity for the participant to talk about his orher reactions to the test session, the question is appropriate.But if the question is added to see if customers would buy theproduct, the question is not appropriate. A six-participant us-ability test is not an appropriate method for estimating sales ormarket share. Obviously, a company would not base its salesprojections on the results of such a question, but people whoread the test report may draw inappropriate conclusions, forexample, when the product has several severe usability prob-lems, but five of the six participants say that they would buy it.The participants' answers could provide an excuse for ignoringthe usability problems. It is best not to include such questions orrelated ones about whether customers would use the manual.

The other common misconception about the purpose of atest is to view it as a research experiment. The fact is, a usabilitytest looks like research. It often is done in a "lab," and watchingparticipants think out loud fits a stereotype some people haveabout what a research study looks like. But a usability test is nota research study (Dumas, 1999).

A Usability Test Is Not a Focus Group. Usability testingsometimes is mistaken for a focus group, perhaps the most usedand abused empirical method of all time. People new to user-based evaluation jump to the conclusion that talking with usersduring a test is like talking with participants in a focus group.But a usability test is not a group technique, although two partic-ipants are sometimes paired, and a focus group is not a usabilitytest unless it contains the six essential components of a test.The two components of a usability test that are most often miss-ing from a focus group are (a) a primary emphasis is not onusability and (b) the participants do not perform tasks duringthe session.

The most common objective for a usability test is the diagno-sis of usability problems. When testers use the term usabilitytest with no qualifier, most often they are referring to a diagnos-tic test. When the test has another purpose, it has a qualifiersuch as comparison or baseline.

When Informal Really Means Invalid. One of the difficul-ties in discussing usability testing is finding a way to describe atest that is somewhat different from a complete diagnostic us-ability test. A word that is often used to qualify a test is informal,but it is difficult to know what informal really means. Thomas(1996) described a method, called "quick and dirty" and"informal," in which the participants are not intended users ofthe product and in which time and other measures of efficiencyare not recorded. Such a test may be informal in some sense ofthat word, but it is certainly invalid and should not be called ausability test. It is missing one of the essentials: potential users.It is not an informal usability test because it is not a usability testat all. Still we need words to describe diagnostic tests that differfrom each other in important ways. In addition, tests that areperformed quickly and with minimal resources are best called"quick and clean" rather than "informal" or "quick and dirty"(Wichansky, 2000).

Page 7: User Based Evaluations

1098 • DUMAS

The Participants Are End Users or Potential End Users

A valid usability test must test people who are part of the targetmarket for the product. Testing with other populations may beuseful, that is, it may find usability problems. But the resultscannot be generalized to the relevant population—the peoplefor whom it is intended.

The key to finding people who are potential candidates forthe test is a user profile (Branaghan, 1997). In developing a pro-file of users, testers want to capture two types of characteristics:those that the users share and those that might make a differenceamong users. For example, in a test of an upgrade to a design fora cellular phone, participants could be people who now owna cell phone or who would consider buying one. Of the peo-ple who own a phone, you may want to include people whoowned the previous version of the manufacturer's phone andpeople who own other manufacturers' phones. These charac-teristics build a user profile. It is from that profile that you createa recruiting screener to select the participants.

A common issue at this stage of planning is that there aremore relevant groups to test than there are resources to testthem. This situation forces the test team to decide on whichgroup or groups to focus. This decision should be based on theproduct management's priorities not on how easy it might beto recruit participants. There is almost always a way to find thefew people needed for a valid usability test.

A Small Sample Size Is Still the Norm. The fact that us-ability testing uncovers usability problems quickly remains oneof its most compelling properties. Testers know from experi-ence that in a diagnostic test, the sessions begin to get repeti-tive after running about five participants in a group. The earlyresearch studies by Virzi (1990, 1992; see Fig. 56.1) showing

that 80% of the problems are uncovered with about five partici-pants and 90% with about 10 continue to be confirmed (Law &Vanderheiden, 2000).

What theses studies mean for practitioners is that, given asample of tasks and a sample of participants, just about all ofthe problems testers will find appear with the first 5 to 10 par-ticipants. This research does not mean that all of the possibleproblems with a product appear with 5 or 10 participants, butmost of the problems that are going to show up with one sampleof tasks and one group of participants will occur early.

There are some studies that do not support the finding thatsmall samples quickly converge on the same problems. Lewis(1994) found that for a very large product, a suite of office pro-ductivity tools, 5 to 10 participants was not enough to find nearlyall of the problems. The studies by Molich et al. (1998, 2001)also do not favor convergence on a common set of problems.

As I discuss later, the issue of how well usability testing un-covers the most severe usability problems is clouded by theunreliability of severity judgments.

Recruiting Participants and Getting Them to Show Up. Torun the test you plan, you will need to find candidates and qualifythem for inclusion in the test. Usually, there are inclusion andexclusion criteria. For example, from a user profile for a testof an instruction sheet that accompanies a ground fault circuitinterrupter (GFCI)—the kind of plug installed in a bathroomor near a swimming pool—a test team might want to includepeople who consider themselves "do-it-yourselfers" and whowould be willing to attempt the installation of the GFCI butexclude people who actually had installed one before or whowere licensed electricians. The way the testers would qualifycandidates is to create a screening questionnaire containing thespecific questions to use to qualify each candidate. Then they

Proportion ofProblems Uncovered

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.05 10 15

Number of Participants in Test

20

FIGURE 56.1. An idealized curve showing the number of participantsneeded to find various proportions of usability problems.

Page 8: User Based Evaluations

56. User-Based Evaluations • 1099

have to find candidates to recruit. It often takes some social skillsand a lot of persistence to recruit for a usability test. It takes afull day to recruit about six participants. For the test of the GFCIinstruction sheet, they may have to go to a hardware store andapproach people who are buying electrical equipment to findthe relevant "do-it-yourselfers."

Many organizations use recruiting firms to find test partici-pants. Firms charge about $100 (in year 2000 U.S. dollars) foreach recruited participant. But the testers create the screeningquestions and test them to see if the people who qualify fit theuser profile.

To get participants to show up, the testers or the recruitingfirm need to do the following:

• Be enthusiastic with them on the phone.• Offer them some incentive. Nothing works better than money,

about $50 to $75 an hour (in year 2000 dollars) for participantswithout any unusual qualifications. Some testing organizationsuse gift certificates or free products as incentives. For partic-ipants with unusual qualifications, such as anesthesiologistsor computer network managers, the recruiter may need toemphasize what the candidates are contributing to their pro-fession by participating in the test.

• As soon as participants are qualified, send or fax or e-mailciting the particulars discussed on the phone and a map withinstructions for getting to the site.

• Contact participants one or two days before the test as areminder.

• Give participants a phone number to call if they can't makethe session, need to reschedule, or will be late.

If testers follow all of these steps, they will still have a no-show rate of about 10%. Some organizations over recruit fora test, qualifying some extra candidates to be backups in caseanother participant is a no-show. A useful strategy can be torecruit two participants for a session and, if both show up, runa codiscovery session with both participants. See below for adescription of codiscovery.

There Is a Product or System to Evaluate

Usability testing can be performed with most any technology.The range includes the following:

• Products with user interfaces that are all software (e.g., adatabase management system), all hardware (a high-qualitypen), and those that are both (a cell phone, a clock radio, ahospital patient monitor, a circuit board tester, etc.)

• Products intended for different types of users (such as con-sumers, medical personnel, engineers, network managers,high school students, computer programmers, etc.)

• Products that are used together by groups of users, such ascooperative work software (Scholtz & Bouchette, 1995)

• Products in various stages of development (such as user-interface concept drawings; early, low-tech prototypes; more

fully functioning, high-fidelity prototypes; products in betatesting, and completed products)

• Components that are imbedded in or accompany a product(such as print manuals, instruction sheets that are packagedwith a product, tutorials, quick-start programs, online help,etc.)

Testing Methods Work Even With Prototypes. One of themajor advances in human-computer interaction over the last15 years is the use of prototypes to evaluate user interface de-signs. The confidence that evaluators have in the validity ofprototypes has made it possible to move evaluation sooner andsooner in the development process. Evaluating prototypes hasbeen facilitated by two developments: (a) using paper proto-types and (b) using software specifically developed for proto-typing. Ten or more years ago, usability specialists wanted tocreate prototyping tools that make it possible to save the codefrom the prototype to use it in the final product. Developerssoon realized that a prototyped version of a design is seldomso close to the final design that it is worth saving the code.Consequently, the focus of new prototyping tools has been onspeed of creating an interactive design. In addition, the speedwith which these software tools and paper prototypes can becreated makes it possible to evaluate user interface conceptsbefore the development team gets so enamored with a conceptthat they won't discard it.

There have been several studies that have looked at the va-lidity of user testing using prototypes. These studies comparepaper or relatively rough drawings with more interactive andpolished renderings. The studies all show that there are few dif-ferences between high- and low-fidelity prototypes in terms ofthe number or types of problems identified in a usability test orin the ratings of usability that participants give to the designs(Cantani & Biers, 1998; Landay & Myers, 1995; Virzi, Sokolov, &Karis, 1996; Wiklund, Dumas, & Thurrot, 1992).

The Participants Think Aloud As They Perform Tasks

This is the execution phase of the test. It is where the test par-ticipant and the test administrator interact and it is where thedata are collected. Before the test session starts, the administra-tor gives a set of pretest instructions. The instructions tell theparticipant how the test will proceed and that the test probesthe usability of the product, not their skills or experience.

One of the important parts of the pretest activities is theinstructions on thinking aloud. The administrator tells the par-ticipants to say out loud what they are experiencing as theywork. Interest in thinking aloud was revived with the recentpublication of two articles that were written independently atalmost the same time (Boren & Ramey, 2000; Dumas, 2001).Both of these articles question assumptions about the similar-ity between the think-aloud method as used in usability testingand the think-aloud method used in cognitive psychology re-search. Until these articles were published, most discussions ofthe think-aloud method used in usability testing automaticallynoted its superficial resemblance to the method described byEricsson and Simon (1993) to study human problem solving.

Page 9: User Based Evaluations

1 100 • DUMAS

In both methods, the participants are taught to think aloud byproviding instructions on how to do it, showing an example ofthe think aloud by giving a brief demonstration of it, and byhaving the participant practice thinking aloud. But there endsthe similarity to research.

In cognitive psychology research, thinking aloud is used tostudy what is in participants' short-term memory. Called Level 1thinking aloud, the research method focuses on having theparticipant say out loud what is in the participants short-termmemory, which can only occur when the participants' describewhat they are thinking as they perform cognitive tasks such asmultiplying two numbers. Participants are discouraged from re-porting any interpretations of what is happening, any emotionsthat accompany the task, and their expectations or violationsof them. The research method is thought, by its proponents, toprovide entree only into the short-term memory of the partici-pants in the research.

In usability testing, the focus is on interactions with the ob-ject being tested and with reporting not only thoughts, but alsoexpectations, feelings, and whatever the participants want toreport. Reports of experiences other than thoughts are impor-tant because they often are indicators of usability problems.

This discrepancy between the two sets of think-aloud in-structions led Boren and Ramey (2000) to look at how think-ing aloud is used in user testing as well as practices relatedto thinking aloud, such as how and when to encourage par-ticipants to continue to do it. They reported the results of ob-serving test administrators implementing these practices andhow little consistency there is among them. Boren and Rameyexplored other aspects of the verbal communication betweenadministrators and participants, including how to keep theparticipants talking without interfering with the think-aloudprocess.

Bowers and Snyder (1990) conducted a research studyto compare the advantages and disadvantages of having testparticipants think out loud as they work, called concurrentthinking aloud, with thinking out loud after the session, calledretrospective thinking aloud. In the retrospective condition, theparticipants performed tasks in silence then watched a video-tape of the session while they thought aloud. This is an interest-ing study because of its implications for usability testing. Thegroup of participants who performed concurrent thinking aloudwere not given typical think-aloud instructions for a usabilitytest. Instead, their instructions were typical of a think aloud re-search study. Participants were told to "describe aloud what theyare doing and thinking." They were not told to report any otherinternal experiences. In addition, they were never interruptedduring a task. There was no probing. Any encouragement theyneeded to keep talking was only done between tasks. The ret-rospective participants were told that they would be watchingthe videotape of the session after the tasks and would be askedto think aloud then.

There were several interesting results. First, there were nodifferences between the concurrent and retrospective groupsin task performance or in task difficulty ratings. The thinkingaloud during the session did not cause the concurrent group totake more time to complete tasks, to complete fewer tasks, or torate tasks as more difficult in comparison with the performance

of the retrospective participants. These findings are consistentwith the results from other think-aloud research studies,although in some studies thinking aloud does take longer.

The differences between the groups were in the typesof statements the participants made when they thought outloud. The concurrent group verbalized about 4 times as manystatements as the retrospective group, but the statements werealmost all descriptions of what the participants were doing orreading from the screen. The participants who did concurrentthinking aloud were doing exactly as they were instructed;they were attending to the tasks and verbalizing a kind of"play-by-play" of what they were doing. The participants in theretrospective condition made only about one fourth as manystatements while watching the tape, but many more of thestatements were explanations of what they had been doingor comments on the user interface design. "The retrospectivesubjects... can give their full attention to the verbalizationsand in doing so give richer information" (Bowers & Snyder,1990, p. 1274).

This study shows us what would happen if we tried to get par-ticipants in usability tests to report only Level 1 verbalizationsand did no probing of what there were doing and thinking. Theirverbalizations would be much less informative. The study doesshow that retrospective thinking aloud yields more diagnosticverbalizations, but it takes 80% longer to have the participants dothe tasks silently then to think out loud as they watch the tape.

There have been two fairly recent studies that compared acondition in which the tester could not hear the think aloudof participants with a condition in which they could (Lesaigle& Biers, 2000; Virzi, Sorce, & Herbert, 1993). In the Virzi et al.study, usability professionals who recorded usability problemsfrom a videotape of test participants thinking aloud, were com-pared with usability professionals who could see only the per-formance data of the test participants. Those who had onlythe performance data uncovered 46% of the problems with theproduct, whereas those seeing the think aloud condition uncov-ered 69%. In the Lesaigle and Biers study, usability profession-als who could see a video of only the screens the participantscould see were compared with comparable professionals whocould see the screens and hear the participants think aloud. Theresults showed that there were fewer problems uncovered inthe screen only condition compared with the screen plus think-aloud condition. Both of these studies suggest that in many casesparticipants' think-aloud protocols provide evidence of usabilityproblems that do not otherwise show up in the data.

Conflict in Roles. Dumas (2001) explored how difficult itcan be for test administrators to keep from encouraging or dis-couraging participants' positive or negative statements. He sawa conflict in two roles that administrators play: (a) the friendlyfacilitator of the test and (b) the neutral observer of the inter-action between the participant and the product.

The friendly facilitator role and the neutral observer rolecome into conflict when participants make strong statementsexpressing an emotion such as, "I hate this program!" Almostanything the test administrator says at that point can influencewhether the participants will report more or fewer of thesenegative feelings. Consider the following statements:

Page 10: User Based Evaluations

56. User-Based Evaluations • 1101

"Tell me more about that": relatively neutral in content but couldbe interpreted as encouraging more negative statements

"That's great feedback": again relatively neutral to someonewho has training in test administration but, I believe, soundsevasive to participants

"Those are the kinds of statements that really help us to under-stand how to improve the product": reinforcing the negative

"I really appreciate your effort to help us today": says nothingabout the content of what the participant said and is part ofplaying the friendly role with participants. Will the partici-pant hear it that way?

Silence: Neutral in content, but how will it be interpreted? In hu-man interaction, one person's silence after another's strongstatement is almost always interpreted as disagreement ordisapproval. Without any other instructions, the participantis left to interpret the test administrator's silence—you don'tcare, you don't want negative comments, strong feelings areinappropriate in this kind of test, and so on.

Not all of the biasing responses to emotional statements areverbal. If the tester is in the room with the participant andtakes notes when participants make an emotional statement,he or she may be reinforcing them to make more. Any of theseresponses could push test participants to utter more or fewerstrong feelings. Dumas suggested that one way to avoid thisconflict in roles is to tell participants what the two roles are inthe pretest instructions.

The Special Case of Speech-Based Products. For the mostpart, the basic techniques of user testing apply to speech ap-plications. There are a few areas, however, where testers mayneed to modify their methods (Dobroth, 1999):

• It is not possible for test participants to think aloud whilethey are using a speech recognition application because talkinginterferes with using the application. If participants were tospeak aloud, the verbalized thoughts may be mistaken for inputby the speech recognizer. Moreover, if participants think aloudwhile using a speech application, they may not be able to hearspoken prompts. One way to get around this problem is to haveparticipants comment on the task immediately after finishingit. This works well for tasks that are short and uncomplicated.If tasks are longer, however, participants will begin to forgetexactly what happened in the early parts of the task. In this case,the test administrator can make a recording of the participants'interaction with the system as they complete the task. At theend of the task, participants listen to the recording and stop it tocomment on parts of the interaction that they found either clearor confusing. Both of these solutions provide useful informationbut add a substantial amount of time to test sessions.

• Evaluating speech-based interfaces often is complicated bythe presence of a recognizer in the product. The recognizer in-terprets what the test participant says. Often the recognizercan't be changed; it is the software that surrounds it that is be-ing tested. Using a poor recognizer often clouds the evaluationof the rest of the software. In a "Wizard-of-Oz" test (see chapter52 for more on this technique), the test administrator createsthe impression in participants that they are interacting with a

voice response system. In reality, the flow and logic of each in-teraction is controlled by the test administrator, who interpretsparticipants' responses to prompts, and responds with the nextprompt in the interaction. Using this method also allows theadministrator to be sure that error paths are tested. In a speechinterface, much of the design skill is in dealing with the typesof errors that recognizers often make.

• In the past, it was much more difficult to create a proto-type of a speech-based product, but several options are nowavailable. Another option is to use the speech capabilities ofoffice tools such as Microsoft's PowerPoint.

Selecting Tasks. One of the essential requirements of everyusability test is that the test participants attempt tasks that usersof the product will want to do. When a product of even modestcomplexity is tested, however, there are more tasks than there istime available to test them. Hence the need to select a sample oftasks. Although not often recognized as a liability of testing, thesample of tasks is a limitation to the scope of a test. Componentsof a design that are not touched by the tasks the participantsperform are not evaluated. This limitation in thoroughness is of-ten why testing is combined with usability inspection methods,which have thoroughness as one of their strengths.

In a diagnostic test, testers select tasks for several reasons:

• They include important tasks, that is, tasks that are per-formed frequently or are basic to the job users will want toaccomplish, and tasks, such as log in or installation, that arecritical, if infrequent, because they affect other tasks. With al-most any product there is a set of basic tasks. Basic means tasksthat tap into the core functionality of the product. For exam-ple, a nurse using a patient monitor will frequently look to seethe vital sign values of the patient and will want to silence anyalarms once she or he determines the cause. In addition, thenurse will want to adjust the alarm limits, even though the limitadjustment may be done infrequently. Consequently, viewingvital signs, silencing alarms, and adjusting alarm limits are basictasks.

• They include tasks that probe areas where usability prob-lems are likely. For example, if testers think that users will havedifficulty knowing when to save their work, they may add savingwork to several other tasks. Selecting these kinds of tasks makesit more likely that usability problems will be uncovered by thetest, an important goal of a diagnostic test. But including thesekinds of tasks makes it likely that a diagnostic test will uncoveradditional usability problems. In effect, these tasks pose a moredifficult challenge to a product than if just commonly done orcritical tasks are included. These tasks can make a product lookless usable than if they were not included. As we will see below,this is one of the reasons why a diagnostic test does not providean accurate measure of a product's usability.

• They include tasks that probe the components of a design.For example, tasks that force the user to navigate to the lowestlevel of the menus or tasks that have toolbar shortcuts. The goalis to include tasks that increase thoroughness at uncoveringproblems. When testing other components of a product, suchas a print manual, testers may include tasks that focus on what

Page 11: User Based Evaluations

1102 • DUMAS

is in the manual, such as a task that just asks the participant tolocate a number of items (Branaghan, 1998).

Some additional reasons for selecting tasks are:

• They may be easy to do because they have been redesignedin response to the results of a previous test.

• They may be new to the product line, such as sending an orderfor a drug to the hospital pharmacy.

• They may cause interference from old habits, such as a taskthat has been changed from a previous release of the product.

With so many reasons for selecting tasks, paring the tasklist to the time available is an important part of test planning.Typically testers and developers get together in the early stagesof test planning to create a task list. In addition to including tasksin the list, the testers need to make some preliminary estimateof how long each task will take. The time estimate is importantfor deciding how many tasks to include, and it may also be usefulfor setting time limits for each task. Even in a diagnostic test,time limits are useful because testers want participants to getthrough most of the tasks. Setting time limits is always a bitof a guess. Until you conduct a pilot test, it is difficult to makeaccurate estimates of time limits, but some estimate is necessaryfor planning purposes.

The Tasks Are Presented in Task Scenarios. Almost withoutexception, testers present the tasks that the participants do inthe form of a task scenario. For example:

You've just bought a new combination telephone and answering ma-chine. It is in the box on the table. Take the product out of the box andset it up so that you can make and receive calls.

A good scenario is short, in the user's words not the prod-uct's, unambiguous, and gives participants enough informationto do the task. It never tells the participant how to do the task.

From the beginning, usability testers recognized the artificial-ity of the testing environment. The task scenario is an attempt tobring a flavor of the way the product will be used into the test.In most cases, the scenario is the only mechanism for introduc-ing the operational environment into the test situation. Rubin(1994, p. 125) describes task scenarios as adding context andthe participant's rationale and motivation to perform tasks. "Thecontext of the scenarios will also help them to evaluate elementsin your product's design that simply do not jibe with reality"and "The closer that the scenarios represent reality, the morereliable the test results" (emphasis added). Dumas and Redish(1999, p. 174) said, "The whole point of usability testing is topredict what will happen when people use the product on theirown.... The participants should feel as if the scenario matcheswhat they would have to do and what they would know whenthey are doing that task in their actual jobs" (emphasis added).

During test planning, testers work on the wording of eachscenario. The scenario needs to be carefully worded so as notto mislead the participant to try to perform a different task.Testers also try to avoid using terms in the scenario that give

the participants clues about how to perform the task, such asusing the name of a menu option in the scenario.

In addition to the wording of the task scenarios, their ordermay also be important. It is common for scenarios to have de-pendencies. For example, in testing a cellular phone there maybe a task to enter a phone number into memory and a later taskto change it. A problem with dependencies happens when theparticipant can't complete the first task. Testers have developedstrategies to handle this situation, such as putting a phone num-ber in another memory location that the test administrator candirect the participants to when they could not complete theearlier task.

Testers continue to believe in the importance of scenariosand always use them. There is no research, however, showingthat describing tasks as scenarios rather than simple task state-ments makes any difference to the performance or subjectivejudgments of participants. But taking note of the product's useenvironment may be important, as described in the next section.

The Participants Are Observed, and DataAre Recorded and Analyzed

Capturing Data As They Occur. Recording data during thesession remains a challenge. All agree that testers need to planhow they will record what occurs. There are too many eventshappening too quickly to be able to record them in free-formnotes. The goal is to record key events while they happen ratherthan having to take valuable time to watch videotapes later.There are three ways that testers deal with the complexity ofrecording data:

• Create data collection forms for events that can be anticipated(Kantner, 200la).

• Create or purchase data logging software (Philips & Dumas,1990).

• Automatically capture participant actions in log files (Kantner,2001b) or with specialized software (Lister, 2001).

Figure 56.2 shows a sample data collection form for the taskof saving a file to a diskette in Microsoft Windows. Notice thatit is set up to capture both paths to success and paths to failure.The form also allows for capturing a Reject, which is a task thata participant considers complete but the data collector knowsis not. Rejects are important to note because, although theyare failures, they often have task times that are faster than evensuccessful tasks.

The use of data logging software continues at many of thelarger testing facilities. Most have created their own loggingsoftware. Logging test activities in real time, however, continuesto be a messy process. Testers almost always have to edit the datalog after each session to remove errors and misunderstandings,such as when a task was really over.

It is difficult to use forms or software when the adminis-trator is sitting in the test room beside the participant and isconducting the test alone. Without a data recorder, it is difficult,but still possible, to sit in the test room with the test participantand to record data at the same time.

Page 12: User Based Evaluations

56. User-Based Evaluations • 1103

Task 1. Copy a Word file to a diskette

Pass (Time )

Explorer:Dragged file from one Explorer pane to another with left right buttonFile: Send to: Floppy ACopied and Pasted in Explorer with Toolbar Edit menu with Keyboard

My Documents:Dragged to Desktop then back with left right button CTRL

D File: Send to: Floppy ACopied and Pasted with Toolbar Edit menu with Keyboard

WordOpened Word and chose File: Save as

Fail or Reject (Time_

Word

Chose Save

Help Windows ____Word Topic:

FIGURE 56.2. Sample data collection form.

Collecting data is a special challenge with Web-based prod-ucts. There are so many links and controls on a typical Web pagethat it is difficult to record what is happening short of watch-ing the session again on videotape. This difficulty has renewedinterest in automatic data collection. But the tools to do thiscapture usually record data that are at too low a level to un-cover usability problems. Most usability problems don't need tobe diagnosed at the mouse click or key press level.

Getting Developers and Managers to Watch Test Sessions.One of the important assets of testing is that it sells itself. Watch-ing even a few minutes of live testing can be very persuasive.There are two reasons why testers need to get key project staffand decision makers to come to watch a test session:

• When people see their first live test session they are almostalways fascinated by what they see. They gain understanding ofthe value of the method. Watching a videotape of a session doesnot provide the same experience. Expend whatever effort ittakes to get these people to attend test sessions.

• When developers see live sessions, it is much easier tocommunicate the results to them. When they have seen someof the usability problems themselves, they are much less likely

to resist agreeing on what the most important problems are.Some of them will even become advocates for testing.

Even though testing is known and accepted by a much widercircle of people than it was 10 years ago, the experience ofwatching a user work at a task while thinking aloud still con-verts more people to accept usability practices than any otherdevelopment tool.

Participant and Administrator Sit Together. Most usabil-ity tests are run with a single test participant. Studies showthat when two participants work together, sometimes calledthe codiscovery method (Kennedy, 1989), they make more ut-terances. The nature of the utterances also is different, withcodiscovery participants making more evaluative, as opposedto descriptive statements and making more statements that de-velopers view as useful (Hackman & Biers, 1992). But usingcodiscovery does require recruiting twice as many participants.A related method is to have one participant teach another howto do a task (Vora, 1994).

The Usability Lab Is Now Ubiquitous. Usability labs con-tinue to be built, and there is a brisk business in selling lab

Page 13: User Based Evaluations

1 104 • DUMAS

equipment. The demand for labs is driven by the advantages ofhaving recording equipment and the ability to allow stakehold-ers to view the test sessions. In essence, the method sells itselfin the sense that developers and managers find compelling theexperience of watching a live test session. A testing facility, es-pecially one with a one-way mirror, adds a sense of scientificcredibility to testing, which, as we will discuss below, may be afalse sense.

The basic makeup of a suite of usability test equipment hasnot changed much with time. It consists of video and audiorecording equipment and video mixing equipment. For testingproducts that run on general-purpose computer equipment, acommon setup is a scan converter showing what is on the testparticipant's screen and a video camera focused on the face orhead and shoulders of the participant.

There are some recent innovations in lab equipment that areenhancing measurement. Miniaturization continues to shrinkthe size of almost all lab equipment, hence the arrival of portablelab setups that fit in airplane overhead compartments. Rela-tively inexpensive eye-tracking equipment has made it possibleto know where participants are looking as they work.

The quality of video images recorded during sessions hasalways been poor. Second-generation copies, which are oftenused in highlight tapes, make the quality of highlight tapes evenpoorer. Scan converters selling for under $2,000 produce sur-prisingly poor images, making it difficult to see screen details.

The move to digital video and inexpensive writeable CDspromises to improve recordings and to make it substantiallyeasier to find and edit video segments.

Mimicking the Operational Environment. Testers oftenmake changes to the setup of the test room. Rubin (1994,p. 95) describes the requirements for the testing environment asfollows: "Make the testing environment as realistic as possible.As much as possible, try to maintain a testing environment thatmimics the actual working environment in which the productwill be used." But is putting a couch in the test room to makeit look more like a room in a home simulating the use environ-ment? It may not be, but going to the participant's home for test-ing is a complex process (Mitropoulos-Rundus & Muzak, 1997).

The literature on product evaluation, when viewed from theperspective of 50 years, shows that in complex operational en-vironments, researchers and practitioners have used softwaresimulations or hardware-software simulators to mimic that op-erational environment. For example, aircraft and automobilesimulators are used to study interactions with cockpits and dash-boards as well as for operator training. More recently, hospitaloperating-room simulators have been developed to study equip-ment interaction issues in anesthesiology (Gaba, 1994).

A variable, usually called fidelity, is used to describe the de-gree to which simulations or simulators mimic the operationalenvironment. In those interactions between users and aircraft,automobiles, and operating rooms, the environment is so im-portant that simulations are needed to mimic it. There may beother environments that influence the usability of the productswe test, and we need to think more about the fidelity of our test-ing environments (Wichansky 2000). I discuss this issue furtherin the section Challenges to the Validity of Testing.

The Impact of the Testing Equipment. An issue that hasbeen debated throughout the history of usability testing is theimpact of one-way mirrors and recording equipment on the testparticipants. This debate comes to a head in discussions aboutwhether the test administrator should sit with participants asthey work or stay behind the one-way mirror and talk over anintercom. Some testing groups always sit with the participant,believing that it reduces the participants' anxiety about beingin the test and makes it easier to manage the session (Rubin,1994). Other testing groups normally do not sit with the partic-ipants, believing that it makes it easier to remain objective andfrees the administrator to record the actions of the participants(Dumas & Redish, 1999). There is one study that partially ad-dressed this issue. Barker and Biers (1994) conducted an exper-iment in which they varied whether there was a one-way mirrorand cameras in the test room. They found that the presence ofthe equipment did not affect the participants' performance orratings of usability of the product.

Remote Testing. Remote usability testing refers to situa-tions in which the test administrator and the test participantare not at the same location (Hartson, Castillo, Kelso, Kamler,& Neale, 1996). This can happen for a number of reasons, suchas testing products used by only a few users who are spreadthroughout the country or the world. Products such as Net-Meeting software make it possible for the tester to see what ison the participant's screen and, with a phone connection, hearthe participants think aloud.

There are other technologies that can provide testers witheven more information, but they often require both parties tohave special video cards and software. There are also technolo-gies for having ratings or preference questions pop up whileparticipants are working remotely (Abelow, 1992). But Lesaigleand Biers' (2000) study showed that uncovering problems onlythrough participants' questionnaire data had the least overlapwith conditions in which testers uncovered problems by watch-ing the participants work or seeing the screens on which partic-ipants. They concluded that "the present authors are skepticalabout using feedback provided by the user through online ques-tionnaires as the sole source of information" (p. 587). Still, noone would disagree that some remote testing is better than notesting at all.

The primary advantages of remote testing include the follow-ing:

• Participants are tested in an environment in which they arecomfortable and familiar.

• Participants are tested using their own equipment environ-ment.

• Test costs are reduced because participants are easier to re-cruit, do not have to travel, and often do not have to be com-pensated. In addition, there are no test facility costs.

But there can be disadvantages to remote testing:

• With live testing, the viewer or conferencing software canslow down the product being tested.

• Company firewalls can prevent live testing. Most viewers andmeeting software often cannot be used if there is a firewall.

Page 14: User Based Evaluations

56. User-Based Evaluations • 1105

Perkins (2001) described a range of remote usability testing op-tions that includes user-reported critical incidents, embeddedsurvey questions, live remote testing with a viewer, and liveremote testing with conferencing software.

Measures and Data Analysis

In this section, I discuss test measures, discrepancies betweenmeasures, and data analysis.

Test Measures. There are several ways to categorize themeasures taken in a usability test. One is to break them intotwo groups: (a) performance measures, such as task time andtask completion, and (b) subjective measures, such as ratingsof usability or participants' comments. Another common break-down uses three categories: (a) efficiency measures (primar-ily task time), (b) effectiveness measures (such as task suc-cess), and (c) satisfaction measures (such as rating scales andpreferences).

Most performance measures involve time or simple countsof events. The most common time measure is time to completeeach task. Other time measures include the time to reach inter-mediate goals such as the time for events such as to finding anitem in Help. The counts of events in addition to task comple-tion include the number of various types of errors, especiallyrepeated errors, and the number of assists. An assist happenswhen the test administrator decides that the participant is notmaking progress toward task completion, but the administratorcan continue to learn more about the product by keeping theparticipant working on the task. An assist is important becauseit indicates that there is a usability problem that will keep par-ticipants from completing a task. The way assists are given toparticipants by test administrators, part of the art of running atest, is not consistent from one usability testing organization toanother (Boren & Ramey, 2000).

There are some complex measures that are not often usedin diagnostic tests but are sometimes used in comparison tests.These measures include the time the participant works towardthe task goal divided by the total task time (sometimes calledtask efficiency) and the task time for a participant divided by theaverage time for some referent person or group, such as an ex-pert or an average user. It seems only natural that an importantmeasure of the usability of a product should be the test partic-ipants' opinions and judgments about the ease or difficulty ofusing it. The end of the test session is a good time to ask forthose opinions. The participant has spent an hour or two usingthe product and probably has as much experience with it as heor she is likely to have. Consequently, a posttest interview or abrief questionnaire is a common subjective measure (see, how-ever, the discussion about the discrepancies between measuresin the next subsection).

Eye tracking is a relatively new measure in user testing. Eye-tracking equipment has come down in price in recent years.You can purchase a system for about $40,000, plus or minus$ 10,000, depending on accessories and data reduction software.The new systems are head mounted and allow the test partic-ipant a good deal of movement without losing track of wherethe eye is looking.

Interest in where participants are looking has increased withthe proliferation of Web software. There are so many links andcontrols on a Web page that it can be difficult to know exactlywhere participants are looking. An eye tracker helps solve thatproblem. Not all test participants can be calibrated on an eyetracker; up to 20 percent of typical user populations cannot becalibrated because of eye abnormalities.

The data from the tracker is broken into fixations—300-millisecond periods during which the point of regard doesn'tmove more than 1° of visual angle. Each fixation has a start time,duration, point of gaze coordinates, and average pupil diameter.Eye movement analysis involves looking at fixations within anarea of interest (AOI), which is a tester-defined area on a screen.These regions usually define some object or control on a page.There are AOIs on each page and testers use eye-tracking soft-ware to compute statistics such as the average amount of timeand the number and duration of fixations in each AOI. Thenthere are statistics that measure eye movements from one AOIto another and plots of scan paths.

Eye tracking systems produce a great deal of data; a 60-Hztracker produces 3,600 records a minute. Consequently, datareduction becomes a major task. Consequently, eye trackingisn't something that is used without a specific need. Goldberg(2000) identified evaluation criteria that can benefit most fromeye tracking data, with visual clarity getting the most benefit.But there are other evaluation areas that are likely to benefit astracking technology becomes cheaper and easier to manage.

Discrepancies Between Measures. Some investigators findonly a weak correlation between efficiency measures and ef-fectiveness measures. Frokjaer, Hertzum, and Hornbaek (2000)described a study in which they found such a weak correlation.They then went back and looked at several years of usability testreports in the proceedings of the annual CHI conference. Theynoted that it is common for testers to report only one categoryof performance measure and caution not to expect differenttypes of measures to be related.

There is a vocal minority of people writing about usabilitytesting measures who argue against the use of quantitative per-formance measures in favor of a qualitative analysis of test data.Hughes (1999) argued that qualitative measures can be just asreliable and valid as qualitative measures.

A common finding in the literature is that performancemeasures and subjective measures are often weakly correlated.Lesaigle and Biers (2000) compared how well testers uncoveredusability problems under a number of conditions:

• They only can see the screen the participant sees.• They can see the screen and hear the participants think aloud.• They can see screens and hear the think aloud and see the

participants face.• They see only the responses to questionnaire items.

The results show that uncovering problems only throughparticipants' questionnaire data had the least overlap with theother three conditions. The authors concluded that "question-naire data taps a somewhat different problem set," and "the ques-tionnaire data was less likely to reveal the most severe problems

Page 15: User Based Evaluations

1 106 • DUMAS

(p. 587). Bailey (1993) and Ground and Ensing (1999) both re-ported cases in which participants perform better with productsthat they don't prefer and vice versa. Bailey recommended usingonly performance measures and not using subjective measureswhen there is a choice.

One of the difficulties with test questions is that they areinfluenced by factors outside of the experience that partici-pants have during the test session. There are at least threesources of distortions or errors in survey or interview data:(a) the characteristics of the participants, (b) the character-istics of the interviewer or the way the interviewer interactswith the participant, and (c) the characteristics of the task sit-uation itself. Task-based distortions include such factors as theformat of questions and answers, how participants interpret thequestions, and how sensitive or threatening the questions are(Bradburn, 1983). In general, the characteristics of the task sit-uation produce larger distortions than the characteristics of theinterviewer or the participant. Orne (1969) called these taskcharacteristics the "demand characteristics of the situation."(See Dumas, 1998b, 1998c, for a discussion of these issues ina usability testing context.) In addition to the demand charac-teristics, subjective measures can be distorted by events in thetest, such as one key event, especially one that occurs late inthe session.

Creating closed-ended questions or rating scales that probewhat the tester is interested in is one of the most difficult chal-lenges in usability test methodology. Test administrators seldomhave any training in question development or interpretation.Unfortunately, measuring subjective states is not a knowledgearea where testers' intuition is enough. It is difficult to createvalid questions, that is, questions that measure what we wantto measure. Testers without training in question developmentcan use open-ended questions and consider questions as an op-portunity to stimulate participants to talk about their opinionsand preferences.

Testers often talk about the common finding that the wayparticipants perform using a product is at odds with the waythe testers themselves would rate the usability of the product.There are several explanations for why participants might saythey liked a product that, in the testers eyes, was difficult to use.Most explanations point to a number of factors that all push userratings toward the positive end of the scale. Some of the factorshave to do with the demand characteristics of the testing situ-ation, for example, participants' need to be viewed as positiverather than negative people or their desire to please the test ad-ministrator. Other factors include the tendency of participantsto blame themselves rather than the product and the influenceof one positive experience during the test, especially when itoccurs late in the session.

Test participants continue to blame themselves for problemsthat usability specialists would blame on the user interface.This tendency seems to be a deep-seated cultural phenomenonthat doesn't go away just because a test administrator tells theparticipant during the pretest instructions that the session is nota test of the participants' knowledge or ability. These positiveratings and comments from participants often put testers in asituation in which they feel they have to explain away partici-pants' positive judgments with the product. Testers always feel

that the performance measures are true indicators of usability,whereas subjective statements are unreliable. For example, avery long task time or a failure to complete a task is a true mea-sure of usability, whereas a positive rating of six out of seven onusability is inflated by demand characteristics.

Data Analysis. Triangulation of measures is critical. It israre that a usability problem affects only one measure. For ex-ample, a poorly constructed icon toolbar will generate errors(especially picking the wrong icon on the toolbar), slow tasktimes (during which participants hesitate over each icon andfrequently click through them looking for the one they want)and statements of frustration (participants express their feelingsabout not being able to learn how the icons are organized or beable to guess what an icon will do from the tool tip).

Much of the data analysis involves building a case for a us-ability problem by combining several measures, a process thatis called triangulation (Dumas & Redish, 1999). The case build-ing is driven by the problem list created during the test sessions.It is surprising how much of this analysis is dependant on thethink-aloud protocol. We depend on what participants say tohelp us understand what the problem is.

Identifying usability problems is key. Most usability problemsdo not emerge from the analysis of the data after the test. Theproblems are observed during the sessions and are recordedon problem sheets or data logs. Later, the problem sheet or logdrives the data analysis. The problem sheet is usually createdby the test administrator during the test sessions or immedi-ately afterward. The sheet is organized by participant and bytask. What gets recorded on the sheet are observations, such as"didn't see the option," and interpretations, such as "doesn't un-derstand the graphic." When the same problem appears again,it is noted.

Experienced usability testers see the basic causes of prob-lems. From individual instances of problems, the experiencedtester sees patterns that point to more general problems. Forexample, a tester might see instances of participants spendingtime looking around the screen and aimlessly looking throughmenu options and conclude that "the participants were over-whelmed with the amount of information on the screen." Froma number of instances of participants not understanding terms,the tester might conclude "the interface has too much technicaland computer jargon." From a number of instances of partici-pants doing a task twice to make sure it was completed, thetester might conclude that "there is not enough feedback aboutwhat the system is doing with the participant's actions." Seeingthe underlying causes of individual problem tokens is one of theimportant skills that a usability tester develops. It is not entirelyclear that such skills can be taught quickly. Testers often haveyears of experience studying and practicing problem identifica-tion skills. But do experienced testers see the same problemsand causes? As I discuss later, there is some doubt about theconsistency of problem labeling.

While watching a test session, a product developer will seethe same events or tokens as the test administrator. But devel-opers tend to see all problems as local. Instead of seeing thatthere needs to be a general review of the language in the inter-face, the developer sees problems with individual words. This

Page 16: User Based Evaluations

56. User-Based Evaluations • 1107

conflict often doesn't appear until the testers and developerssit down to discuss what they saw and what to do about it. Us-ability professionals believe that this conflict over what "really"happened during the test remains a major barrier to improvinga product's usability. Handling this conflict takes some diplo-macy. Developers don't like to be told that they have tunnelvision and can't see the underlying causes of individual tokens,and usability professions don't like hearing that the local fix willsolve the problem. This conflict continues to limit the impactof testing on product improvement.

There have been several research studies that have looked athow many usability problems are uncovered by different popu-lations. These studies consistently show that usability specialistsfind more problems than product developers or computer sci-entists. But all of the studies have used inspection evaluationmethods not user-based evaluation methods.

One of the issues still being debated about usability problemsis whether to place them into a company's software bug trackingsystem (Wilson & Coyne, 2001). Putting them into the systemcan be effective if the bugs are more likely to be fixed. But fittingthe bugs into a bug severity rating scale often is difficult, andthere is always a risk that the fix will solve only the local impactof the problem not its basic structural cause. Some bug trackingsystems require that a bug be assigned only one cause, whichwould not adequately describe many usability problems.

One way to call attention to important problems is to putthem into a measurement tool such as a problem severity scale.These scales determine which problems are the most severeand, presumably, more likely to be candidates to be fixed. Therehave been several recent research studies that have looked atthe validity the reliability of these scales.

A disappointing aspect of this research is the lack of con-sistency in severity judgments. This lack appears in all formsof usability evaluation, inspection and user based, and is one ofthe most important challenges to usability methodology. Severalpractitioners have proposed severity rating schemes: Nielsen(1992), Dumas and Redish (1999), Rubin (1994), and Wilson andCoyne (2001). The schemes have three properties in common:

1. They all use a rating scale that is derived from softwarebug reporting scales. The most severe category usually involvesloss of data or task failure and the least severe category involvesproblems that are so unimportant that they don't need an imme-diate fix. All of the authors assume that the measurement levelof their scale is at least ordinal, that is, the problems gets worseas the scale value increases. The middle levels between the ex-tremes are usually difficult to interpret and are stated in wordsthat are hard to apply to specific cases. For example, Dumas andRedish proposed two middle levels: (a) problems that create sig-nificant delay and frustration and (b) problems that have a minoreffect on usability. Nielsen's middle levels are (a) major usabilityproblem (important to fix and so should be given high priority)and (b) minor usability problem (fixing is given low priority).Practitioners are not given any guidance on how problems fitinto the scale levels, especially the middle ones.

2. All of the authors admit, at least indirectly, that their scalesalone are not enough to assess severity. The authors propose oneor more additional factors for the tester to consider in judging

severity. For example, Nielsen (1992) described four factors inaddition of the severity rating itself: frequency, impact, persis-tence, and something called "market impact." Rubin (1994) pro-posed multiplying the rating by the number of users who havethe problem. Dumas and Redish (1999) added a second dimen-sion: the scope of the problem from local to global, with nolevels in between. With the exception of Rubin's multiplicationrule, none of these other factors are described in enough de-tail to indicate how their combination with the severity scalewould work, which is, perhaps, an indicator of the weakness ofthe severity scales themselves.

3. None of the scales indicate how to treat individual dif-ferences. For example, what does one do if only two of eightparticipants cannot complete a task because of a usability prob-lem. Is that problem in the most severe category or does it movedown a level? If a problem is global rather than local, does thatchange its severity? The authors of these scales provide littleguidance.

There have been a number of research studies investigat-ing the consistency of severity ratings. These studies all showthat the degree of consistency is not encouraging. Most studieshave looked at the inconsistencies among experts using sever-ity scales with inspection methods such as heuristic evaluation.But Jacobsen and John (1998) showed that it also applies tousability testing. They asked four experienced usability testersto watch tapes of the same usability test and then identify prob-lems, including the top-10 problems in terms of severity. Ofthe 93 problems identified with the product, only 20% weredetected by all evaluators, whereas 46% were only found by asingle evaluator. None of the top-10 severe problems appearedon all four evaluators' lists.

Lesaigle and Biers (2000) reported a disappointing correla-tion coefficient (0.16) among professional testers' ratings of theseverity of the same usability problems in a usability test. Theyused Nielsen's severity rating scale. Cantani and Biers (1998)found that heuristic evaluation and user testing did not uncoverthe same problems, and that severity ratings of usability profes-sionals did not agree with each other.

The results of these studies cast doubt on one of the mostoften-mentioned assets of usability testing: its touted ability touncover the most severe usability problems.

Communicating Test Results

In the early days of user testing, there almost always was a for-mal test report. Testers needed reports to communicate whatthey did, what they found, and what testing was all about. Nowit is more common for the results of a test to be communi-cated more informally, such as at a meeting held soon after thelast test session. Communication at these meetings is facilitatedwhen the product team has attended at least some of the testsessions.

One of the important reasons for the change in reportingstyle for diagnostic usability tests is the confidence organizationshave in the user testing process. It now is less often necessary towrite a report to justify conducting the test. Organizations with

Page 17: User Based Evaluations

1 108 • DUMAS

active usability programs have come to accept user testing as avalid and useful evaluation tool. They don't feel that they needto know the details of the test method and the data analysis pro-cedures. They want to know the bottom line: What problemssurfaced, and what should they do about them? In these organi-zations, a written report may still have value but as a means ofdocumenting the test.

The Value of Highlight Tapes. A highlight tape is a short,visual illustration of the 4 or 5 most important results of a test.In the early days of testing, almost every test had a highlighttape, especially a tape aimed at important decision makers whocould not attend the sessions. These tapes had two purposes:to show what happed during the test in an interesting way andto illustrate what a usability test is and what it can reveal.

As usability testing has become an accepted evaluation tool,the second purpose for highlight tapes has become less neces-sary. One of the disappointing aspects of highlight tapes is thatwatching them does not have the same impact as seeing thesessions live. Unless the action moves quickly, even highlighttapes can be boring. This characteristic makes careful editingof the highlights a must. But if the editing system is not digi-tal, it takes about 1 hour to create 1 minute of finished tape. A15 minute tape can take 2 days to create, even by an experiencededitor. Most of that time is taken finding appropriate segmentsto illustrate key findings. The emergence of digital video willmake highlight tapes less time-consuming.

Some testers use the capabilities of tools such as PowerPointto put selections from a videotape next to a bullet in a slidepresentation rather than having a separate highlight tape. Othershave begun to store and replay video in a different way. Thereare video cards for personal computers that will take a feed froma camera and store images in mpeg format on a compact disk(CD). Each CD stores about an hour of taping. A tester can thenshow an audience the highlights by showing segments of theCDs in sequence, thus eliminating the need for editing. Becausethe cost of blank CDs is only about a dollar, they are cheaper tobuy than videotapes and take up less storage space.

VARIATIONS ON THE ESSENTIALS

In this section, I discuss aspects of usability testing that go be-yond the basics of a simple diagnostic test. The section includesmeasuring and comparing usability, baseline usability test, andallowing free exploration.

Measuring and Comparing Usability

A diagnostic usability test is not intended to measure usabil-ity as much as to uncover as many usability problems as itcan. It doesn't directly answer the question, "How usable isthis product?" It would be wonderful to be able to answer thatquestion with a precise, absolute statement such as, "It's veryusable" or better, "Its 85% usable." But there is no absolute mea-sure of usability, and without a comparative yardstick it is diffi-cult to pinpoint a product's usability.

It would be ideal if we could say that a product is usableif participants complete 80% of their tasks and if they give itan average ease-of-use rating of 5.5 out of 7, with 7 being veryusable. But all tasks and tests are not equal. One of the limitingfactors in measuring usability is the makeup of the diagnostictest itself. It typically tests a very small sample of participants;it encourages those participants to take the time to think aloudand to made useful verbal diversions as they work; it allows thetest administrator the freedom to probe interesting issues andto take such actions as skipping tasks that won't be informative;and it deals with a product that might be in prototype formand, consequently, will occasionally malfunction. Those qual-ities make a diagnostic test good at exploring problems, butlimited at measuring usability.

Historically, human factors professionals have made a distinc-tion between formative and summative measurement. A forma-tive test is done early in development to contribute to a prod-uct's design; a summative test is performed late in developmentto evaluate the design. A diagnostic test is clearly a formativetest. But what specifically is a summative usability test?

At the present time, without a comparison product, we areleft with the judgment of a usability specialist about how us-able a product is based on their interpretation of a summativeusability test. Experienced usability professionals believe thatthey can make a relatively accurate and reliable assessment ofa product's usability given data from a test designed to mea-sure usability, that is, a test with a stable product and a largersample than is typical and one in which participants are dis-couraged from making verbal diversions, and the administratormakes minimal interruptions to the flow of tasks. This expertjudgment is the basis of the common industry format (GIF) I de-scribe below. Perhaps someday, we will be able to make moreprecise measurements based directly on the measures that arenot filtered through the judgment of usability professional. Inthe meantime, those judgments are the best estimate we have.

Comparing the Usability of Products

An important variation on the purpose of a usability test is onethat focuses primarily on comparing usability. Here the intentionis to measure how usable a product is relative to some otherproduct or to an earlier version of itself. There are two typesof comparison tests: (a) an internal usability test focused onfinding as much as possible about a product's usability relativeto a comparison product (a comparative test or a diagnostic,comparative test) and (b) a test intended to produce results thatwill be used to measure comparative usability or to promote thewinner over the others (a competitive usability test).

In both types of comparison tests, there are two importantconsiderations: (a) The test design must provide a valid com-parison between the products and (b) the selection of test par-ticipants, the tasks, and the way the test administrator interactswith participants must not favor any of the products.

Designing Comparison Tests. As soon as the purpose of thetest moves from diagnosis to comparison measurement, the testdesign moves toward becoming more like a research design. To

Page 18: User Based Evaluations

56. User-Based Evaluations • 1 109

demonstrate that one product is better on some measure, youneed a design that will validly measure the comparison. Thedesign issues usually focus on two questions:

• Will each participant use all of the products, some of the prod-ucts, or only one product?

• How many participants are enough to detect a statisticallysignificant difference?

In the research methods literature, a design in which partici-pants use all of the products is called a "within-subjects" design,whereas in a "between-subjects" design each participant usesonly one product. If testers use a between-subjects design, theyavoid having any contamination from product to product, butthey need to make sure that the groups who use each prod-uct are equivalent in important ways. For example, in a typicalbetween-subject design, members of one group are recruited be-cause they have experience with Product A, whereas a secondgroup is recruited because they have experience with ProductB. Each group then uses the product they know. But the twogroups need to have equivalent levels of experience with theproduct they use. They also need to have equivalent skills andknowledge with related variables, such as job titles and timeworked, general computer literacy, and so on.

Because it is difficult to match groups on all of the relevantvariables, between-subjects designs need to have enough par-ticipants in each group to wash out any minor differences. Animportant concern to beware of in the between-subjects designis the situation in which one of the participants in a group isespecially good or bad at performing tasks. Gray and Salzman(1998) called this the "wildcard effect." If the group sizes aresmall, one superstar or dud could dramatically affect the com-parison. With larger numbers of participants in a group, thewildcard has a smaller impact on the overall results. This phe-nomenon is one of the reasons that competitive tests have largersample sizes than diagnostic tests. The exact number of partic-ipants depends on the design and the variability in the data.Sample sizes in competitive tests are closer to 20 in a groupthan the 5 to 8 that is common in diagnostic tests.

If testers use a within-subjects design in which each par-ticipant uses all of the products, they eliminate the effect ofgroups not being equivalent but then have to worry aboutother problems, the most important of which are order andsequence effects and the length of the test session. Becausewithin-subjects statistical comparisons are not influenced by in-equalities between groups, they are statistically more powerfulthan between-subjects designs, which means testers need fewerparticipants to detect a difference. To eliminate effects due toorder and the interaction of the product with each other, youneed to counterbalance the order and sequence of the prod-ucts. (See Fisher & Yates, 1963, and Dumas, 1998a, for rules forcounterbalancing.) They also have to be concerned about thetest session becoming so long that participants get tired.

There are some designs that are hybrids because theyuse within-subjects comparisons but don't include all of thecombinations. For example, if tester's are comparing their prod-uct to two of their competitors, they might not care about howthe two competitors compare with each other. In that case,

each participant would use the testers' product and one of theothers, but no one would use both of the competitors' products.This design allows the statistical power of a within-subjectsdesign for some comparisons—those involving your product.In addition, the test sessions are shorter than with the completewithin-subjects design.

Eliminating Bias in Comparisons. For a comparison testto be valid, it must be fair to all of the products. There are at leastthree potential sources of bias: the selection of participants, theselection and wording of tasks, and the interactions betweenthe test administrator and the participants during the sessions.

The selection of participants can be biased in both a between-and a within-subjects design. In a between-subjects design, thebias can come directly from selecting participants who havemore knowledge or experience with one product. The bias canbe indirect if the participants selected to use one product aremore skilled at some auxiliary tasks, such as the operating sys-tem, or are more computer literate. In a competitive test us-ing a between-subjects design, it is almost always necessary toprovide evidence showing that the groups are equivalent, suchas by having them attain similar average scores in a qualifica-tion test or by assigning them to the products by some randomprocess. In a within-subjects design, the bias can come fromhaving the participants have more knowledge or skill with oneproduct. Again, a qualification test could provide evidence thatthey know each product equally well.

Establishing the fairness of the tasks is usually one of themost difficult activities in a comparison test, even more so ina competitive test. One product can be made to look betterthan any other product by carefully selecting tasks. Every userinterface has strengths and weaknesses. The tasks need to beselected because they are typical for the sample of users andthe tasks they normally do. Unlike a diagnostic test, the tasksin a competitive test should not be selected because they arelikely to uncover a usability problem or because they probesome aspect of one of the products.

Even more difficult to establish than lack of bias in task se-lection is apparent bias. If people who work for the companythat makes one of the products select the tasks, it is difficult tocounter the charge of bias even if there is no bias. This prob-lem is why most organizations will hire an outside company orconsultant to select the tasks and run the test. But often theconsultant doesn't know enough about the product area to beable to select tasks that are typical for end users. One solution isto hire an industry expert to select or approve the selection oftasks. Another is to conduct a survey of end users, asking themto list the tasks they do.

The wording of the task scenarios can also be a source of bias,for example, because they describe tasks in the terminologyused by one of the products. The scenarios need to be scrubbedof biasing terminology.

Finally, the test administrator who interacts with each testparticipant must do so without biasing the participants. The in-teraction in a competitive test must be as minimal as possible.The test administrator should not provide any guidance in per-forming tasks and should be careful not to give participants re-warding feedback after task success. If participants are to be told

Page 19: User Based Evaluations

1110 • DUMAS

when they complete a task, it should be done after every com-plete task for all products. Because of the variability of task timesit causes, participant should not be thinking aloud and shouldbe discouraged from making verbal tangents during the tasks.

Baseline Usability Tests

One of the ways to measure progress in user interface design isby comparing the results of a test to a usability baseline. Withouta baseline, it can be difficult to interpret quantitative measuresfrom a test and put them in context. For example, if it takesa sample of participants 7 minutes to complete a task with anaverage of two errors, how does a tester interpret that result?One way is to compare it to a usability goal for the task. Anotheris to compare it to the results of the same task in an earlierversion of the product.

But establishing a baseline of data takes care. Average mea-sures from a diagnostic usability test with a few participantscan be highly variable for two reasons. First, because of thesmall number of participants, average scores can be distorted bya wildcard. Because of this variability, it is best to use a samplesize closer to those from a comparison test than those from adiagnostic test. Second, the thinking-aloud procedure typicallyused in diagnostic tests adds to the variability in performing thetask. It is best not to have participants think aloud in a baselinetest, which makes the data cleaner but also lessens its value asa diagnostic tool.

Allowing Free Exploration

An important issue in user testing is what the participant doesfirst. For example, if all users will have some training before theyuse the product, the tester might want to provide this training.There is often a preamble to the first task scenario that puts thetest and the tasks into some context. Most often, the preambleleads to the first task scenario. Using this procedure immediatelythrows the participant into product use. Some testers argue thatthis procedure is unrealistic, that in the "real world" peopledon't work that way but spend a few minutes exploring theproduct before they start doing tasks. Others argue that goingdirectly to tasks without training or much of a preamble putsstress on the product to stand on its own, stress that is beneficialin making the product more usable.

Should testers consider allowing the test participants 5 to10 minutes of exploration before they begin the task scenarios?Those in favor of free exploration argue that without it, the pro-duct is getting a difficult evaluation and that the testing situationis not simulating the real use environment, especially for Web-based products. Users must know something about the productto buy it, or their company might give them some orientation toit. Those against free exploration argue that it introduces addedvariability into the test; some participants will find informationthat helps them do the tasks, but others won't find the sameinformation. Furthermore, nobody really knows what users dowhen no one is watching. A usability test session is a constructedevent that does not attempt to simulate every component of

the real use environment. Finally, the test is intended to be adifficult evaluation for the product to pass. This debate contin-ues, but most testers do not allow free exploration.

CHALLENGES TO THE VALIDITY OF USABILITYTESTING

For most of its short history, user testing has been remarkablyfree from criticism. Part of the reason for this freedom is thehigh face validity of user testing, which means that it appearsto measure usability. User testing easily wins converts. Whenvisitors watch a test for the first time, they think they are seeinga "real" user spontaneously providing their inner experiencesthrough their think-aloud protocol. Visitors often conclude thatthey are seeing what really happens when no one is there towatch customers. When a usability problem appears in the per-formance of a test participant, it is easy to believe that everyuser will have that problem.

But some impressions of user testing can be wrong. A testsession is hardly a spontaneous activity. On the contrary, a usertest is a very constructed event. Each task and each word ineach scenario has been carefully chosen for a specific purpose.And unfortunately, we don't know what really happens whenno one is watching.

In the past 5 years, researchers and practitioners have begunto ask tough questions about the validity of user testing as partof a wider examination of all usability evaluation methods. Thisskepticism is healthy for the usability profession. Here I discussfour challenges to validity:

1. How do we evaluate usability testing?2. Why can't we map usability measures to user interface com-

ponents?3. Are we ignoring the operational environment?4. Why don't usability specialists see the same usability prob-

lems?

How Do We Evaluate Usability Testing?

One of the consequences of making a distinction between us-ability testing and research is that it becomes unclear how toevaluate the quality and validity of a usability test, especiallya diagnostic test. As I have noted, usability professionals whowrite about testing agree that a usability test is not a researchstudy. Consequently, it is not clear whether the principles ofresearch design should be applied to a diagnostic usability test.Principles, such as isolating an independent variable and havingenough test participants to compute a statistical test, do not ap-ply to diagnostic usability testing. The six essential characteris-tics of user testing described above set the minimum conditionsfor a valid usability test but do not provide any further guidance.For example, are all samples of tasks equal in terms of ensuringthe valid of a test? Are some better than others? Would somesamples be so bad as to invalidate the test and its results? Wouldany reasonable sample of tasks uncover the global usability

Page 20: User Based Evaluations

56. User-Based Evaluations • 1111

problems? Is a test that misses uncovering a severe usabilityproblem just imperfect, or is it invalid?

Dumas (1999) explored other ways to judge the validity ofa user test. For example, Skinner (1956) invented a design inwhich causality between independent and dependent variables"was established with only one animal. By turning the indepen-dent variable on and off several times with the same animal, hewas able to establish a causal relationship between, for exam-ple, a reinforcement schedule and the frequency and variabilityof bar pressing or pecking. In some ways, Skinner's methodis similar to having the same usability problem show up manytimes both between and within participants in a usability test.In this analogy, usability problems that repeat would establish acausal relationship between the presentation of the same taskswith the same product and the response of the participants.This relationship is exactly why a tester becomes confident thatproblems that repeat are caused by a flawed design. But shouldwe end there? Should we only fix repeating problems? And whatif, as often happens, some participants don't have the problem?It is not clear where to draw the repetition line.

Hassenzahl (1999) agued that a usability tester is like a clini-cian trying to diagnose a psychological illness. An effective testeris one who is good at tying symptoms, that is, usability prob-lems, with a cause—a poor design. In this analogy, a goal forthe profession is to create a diagnostic taxonomy to make prob-lem interpretations more consistent. Gray and Salzman (1998)and Lund (1998) have made similar points. Until that happens,however, we are left looking for good clinicians (testers), butwe have little guidance about that makes a valid test.

Why Can't We Map Usability Measuresto User Interface Components?

An important—Gray and Salzman (1998) said the mostimportant—challenge to the validity of usability testing is thedifficulty of relating usability test measures to components ofthe user interface. Practitioners typically use their intuition andexperience to make such connections. For example, a long tasktime along with several errors in performing a task may be at-tributed to a poorly organized menu structure. Would othertesters make the same connection? Do these two measures al-ways point to the same problem? Do these measures only pointto this one problem? Is the problem restricted to one menu orseveral? Are some parts of the menu structure effective?

As we have seen, common practice in test reporting is togroup problems into more general categories. For example, dif-ficulties with several words in an interface might be grouped un-der a "terminology" or a "jargon" category. Unfortunately, thereis no standardized set of these categories. Each test team can rolltheir own categories. This makes the connection from designcomponent to measures even more difficult to make. Landauer(1995) urged usability professions and researchers to link mea-sures such as the variability in task times to specific cognitivestrategies people use to perform tasks. Virzi et al. (1993) com-pared the results of a performance analysis of objective measureswith the results of a typical think aloud protocol analysis. Theyidentified many fewer problems using performance analysis.

That study and others suggest that many problems identified ina usability test come from the think-aloud protocol alone. Couldsome of these problems be false alarms, which is to say they arenot usability problems at all? Expert review, an inspection eval-uation method, has been criticized for proliferating false alarms.Bailey, Allan, and Raiello (1992) claimed that most of the prob-lems identified by experts are false alarms. But they used theproblems that they identified from user testing as the compari-son. If Bailey et al. are correct, most of the problems identifiedby user testing also are false alarms. Their study suggests thatthe only practice that makes any difference is to fix the one ortwo most serious problems found by user testing.

Without a consistent connection between measures and userinterface components, the identification of problems in a usertest looks suspiciously like an ad hoc fishing expedition.

Are We Ignoring the Operational Environment?

Meister (1999) took the human factors profession to task forlargely ignoring the environment within which products andsystems are used (see Fig. 56.3). He asserted that in humanfactors, the influence of the environment on the human-technology interaction is critical to the validity of any evalua-tion. He proposed that human factors researchers have chosenerroneously to study the interaction of people and technologylargely in a laboratory environment. He noted that "any environ-ment in which phenomena are recreated, other than the onefor which it was intended, is artificial and unnatural" (p. 66).Although Meister did not address usability testing directly, hepresumably would have the same criticism of the use of testinglaboratories to evaluate product usability.

Those who believe that it is important to work with usersin their operational environment as the usability specialistsgather requirements also believe that at least early prototypetesting should be conducted in the work environment (Beyer &Holtzblatt, 1997). The assumption these advocates make is thattesting results will be different if the test is done in the work

Human - Technology 1

% Tasks i

FIGURE 56.3. The scope of human factors. From "Usabilitytesting methods: When does a usability test become a re-search experiment?" by J. Dumas, 2000, Common Ground, 10.Reprinted with permission.

Page 21: User Based Evaluations

1112 • DUMAS

environment rather than in a usability lab. These differences willlead to designs that are less effective if the richness of the workenvironment is ignored. The proponents of testing in the workenvironment offer examples to support their belief, but to datethere are no research studies that speak to this issue.

One could imagine a continuum on which to place the influ-ence of the operational environment on product use. For someproducts, such as office productivity tools, it seems unlikely thatthe operational environment would influence the usability of aproduct. For some other products, such as factory floor opera-tional software, the physical and social environments definitelyinfluence product use; and then there is a wide range of prod-ucts that fall in between. For example, would an evaluation ofa design for a clock radio be complete if test participants didn'thave to read the time from across a dark room? Or shut the alarmoff with one hand while lying down in a dark room?

Meister admitted that it is often difficult to create or simulatethe operational environment. The classic case is an accident ina power plant that happens once in 10 years. One of the rea-sons the operational environment is not considered more oftenin usability evaluation is that it is inconvenient and sometimesdifficult to simulate. When we list convenience as a quality of ausability lab, we need to keep in mind that for some products,the lab environment may be insufficient for uncovering all ofthe usability problems in products.

Why Don't Usability Specialists See the SameUsability Problems?

Earlier I discussed the fact that usability specialists who viewedsessions from the same test had little agreement about whichproblems they saw and which ones were the most serious(Jabobsen & John, 1998). There are two additional studies thatalso speak to this point (Molich et al., 1998, 2001). These studiesboth had the same structure. A number of usability labs wereasked to test the same product. They were given broad instruc-tions about the user population and told that they were to doa "normal" usability test. In the first study, four labs were in-cluded; in the second, there were seven. The results of thesestudies were not encouraging. There were many differences inhow the labs went about their testing. It is clear form these stud-ies that there is little commonality in testing methods. But evenwith that caveat, one would expect these labs staffed by usabil-ity professionals to find the same usability problems. In the firststudy, there were 141 problems identified by the four labs. Onlyone problem was identified by all of the labs. Ninety-one percentof the problems were identified by only one lab. In the secondstudy, there were 310 problems identified by the seven teams.Again, only one problem was identified by all seven teams, and75% of the problems were identified by only one team.

Our assumption that usability testing is good method for find-ing the important problems quickly has to be questioned by theresults of these studies. It is not clear why there is so little overlapin problems. Are slight variations in method the cause? Are theproblems really the same but just described differently? We lookto further research to sort out the possibilities.

ADDITIONAL ISSUES

In this final section on usability testing, I discuss five final issues:

1. How do we evaluate ease of use?2. How does user testing compare with other evaluation

methods?3. Is it time to standardize methods?4. Are there ethical issues in user-based evaluation?5. Is testing Web-based products different?

How Do We Evaluate Ease of Use?

Usability testing is especially good at assessing initial ease oflearning issues. In many cases, a usability test probes the firsthour or two of use of a product. Testers see this characteristicas an asset because getting started with a new product is oftena key issue. If users can't get by initial usability barriers, theymay never use the product again, or they may use only a smallpart of it.

Longer term usability issues are more difficult to evaluate.Product developers often would like to know what usabilitywill be like after users learn how to use a product. Will usersbecome frustrated by the very affordances that help them learnthe product in the first place? How productive will power usersbe after 6 months of use?

Although there is no magic potion that will tell developerswhat usability will be like for a new product after 6 months,there are some techniques that address some long-termconcerns:

• Repeating the same tasks one or more times during thesession—this method gets at whether usability problems per-sist when users see them again.

• Repeating the test—a few weeks in between tests providessome estimate of long-term use.

• Providing training to participants who will have it when theproduct is released—establishing a proficiency criterion thatparticipants have to reach before they are tested is a way tocontrol for variations in experience.

Although these techniques sometimes are useful, assessing theease of use for a new product is difficult with any evaluationmethod.

How Does Usability Testing Compare With OtherEvaluation Methods?

In the early 1990s, there were several research studies thatlooked at the ability of user testing to uncover usability problemsand compared testing with other evaluation methods, especiallyexpert reviews and cognitive walkthroughs (Desurvire, 1994;Jeffries, Miller, Wharton, & Uyeda, 1991; Karat, Campbell, &Fiegel, 1992; Nielsen & Phillips, 1993). The evaluation methods

Page 22: User Based Evaluations

56. User-Based Evaluations • 1113

together are now called UEMs—usability evaluation methods.In these studies, testing generally came out quite well in com-parison with the other methods. Its strengths were in findingsevere usability problems quickly and finding unique problems,that is, problems not uncovered by other UEMs.

Jeffries et al. (1991) found that usability testing didn't un-cover as many problems as an expert review and that no one ex-pert found more than 40% of the problems. Furthermore, whenthe authors segmented the problems by severity, usability test-ing found the smallest number of the least severe problems andthe expert reviewers found the most. Karat et al. (1992) com-pared usability testing to two kinds of walkthroughs and foundthat testing found more problems and more severe problems.In addition, usability testing uncovered more unique problemsthan walkthroughs. Desurvire (1994) compared usability test-ing to both expert reviews and walkthroughs and found thatusability testing uncovered the most problems, the most severeproblems, and the most unique problems.

Dumas and Redish (1993), reviewing these studies from ausability testing perspective, summarized the strengths of us-ability testing as uncovering more severe problems than theother methods. Since that time, this clear-cut depiction of us-ability testing has been challenged. All of these studies andmore were reviewed by Gray and Salzman (1998) and Andre,Williges, and Hartson (1999) in a meta-analysis of the compari-son research. In Gray and Salzman's view, all of the studies areflawed, being deficient in one or more of five types of valid-ity. Their analysis makes it difficult to be sure what conclusionsto draw from the comparison studies. Andre et al. proposedthree criteria to evaluate UEMs: thoroughness (finding the mostproblems), validity (finding the true problems), and reliability(repeatedly finding the same problems). They found that theyonly could compare UEM studies on thoroughness, with inspec-tion methods being higher on it than usability testing. Andreet al. could not find sufficient data to compare UEMs on va-lidity or reliability. Fu, Salvendy, and Turley (1998) proposedthat usability testing and expert reviews find different kinds ofproblems.

As we have described above, it now appears that the abil-ity of experts or researchers to consistently agree on whetherproblems are severe or not makes it difficult to tout usability test-ing's purported strength at uncovering severe problems quickly.Even the conclusion that usability testing finds unique problemsis suspect because those problems might be false alarms. Andreet al. proposed that usability testing be held up as the yard-stick against which to compare other UEMs. But the assumptionthat usability testing uncovers the true problems has not beenestablished.

Gray and Salzman's analysis was criticized by usability practi-tioners (Olson & Moran, 1998). The practitioners were not readyto abandon their confidence in the conclusions of the compari-son studies and continue to apply them to evaluate the productsthey develop. To date, no one has shown that any of Gray andSalzman's or Andre et al.'s criticisms of the lack of validity ofthe UEM studies is incorrect. At present, the available researchleaves us in doubt about the advantages and disadvantages ofusability testing relative to other UEMs.

Is It Time to Standardize Methods?

Several standards setting organizations have included user-basedevaluation as one of the methods they recommend or require forassessing the usability of products. These efforts usually take along time to gestate and their recommendations are sometimesnot up to date, but the trends are often indicative of a method'sacceptance in professional circles.

The International Organization of Standardization (ISO) Stan-dard ISO 9241, "Ergonomic requirements for office work withvisual display terminals (VDTs)," describes the ergonomic re-quirements for the use of visual display terminals for office tasks.Part 11 provides the definition of usability, explains how to iden-tify the information which is necessary to take into accountwhen evaluating usability, and describes required measures ofusability. Part 11 also includes an explanation of how the usabil-ity of a product can be evaluated as part of a quality system.It explains how measures of user performance and satisfaction,when gathered in methods such as usability testing, can be usedto measure product usability.

ISO/DIS 13407, "Human-centered design processes for inter-active systems," provides guidance on human-centered designincluding user-based evaluation throughout the life cycle of in-teractive systems. It also provides guidance on sources of infor-mation and standards relevant to the human-centered approach.It describes human-centered design as a multidisciplinary activ-ity, which incorporates human factors and ergonomics methodssuch as user testing. These methods can enhance the effec-tiveness and efficiency of working conditions and counteractpossible adverse effects of use on human health, safety, andperformance.

One of the most interesting efforts to promote usability meth-ods has been conducted by the U.S. Food and Drug Adminis-tration (FDA), specifically the Office of Health and IndustrialPrograms, which approves new medical devices. In a report ti-tled "Do It by Design" (http://www.fda.gov/cdrh/humfac/doit.html), the FDA described what it considers best practices inhuman factors methods that can be used to design and evaluatedevices. Usability testing plays a prominent part in that descrip-tion. The FDA stops short of requiring specific methods butdoes require that device manufacturers prove that they have anestablished human factors program. The FDA effort is an exam-ple of the U.S. Governments' relatively recent but enthusiasticinterest in usability (http://www.usability.gov).

The most relevant standards-setting effort to those who con-duct user-based evaluations is the National Institute of Stan-dards and Technology's (NIST) Industry Usability ReportingProject (IURP). This project has been underway since 1997.It consists of more than 50 representatives of industry, gov-ernment, and consulting who are interested in developingstandardized methods and reporting formats for quantifying us-ability (http://zing.ncsl.nist.gov/iusr/).

One of the purposes of the NIST IUSR project is to providemechanisms for dialogue between large customers, who wouldlike to have usability test data factored into the procurementdecision for buying software, and vendors, who may have us-ability data available.

Page 23: User Based Evaluations

1114 • DUMAS

NIST worked with a committee of usability experts fromaround the world to develop a format for usability test reports,called the common industry format (GIF). The goal of the GIFis to facilitate communication about product usability betweenlarge companies who want to buy software and providers whowant to sell it. The GIF provides a way to evaluate the usabil-ity of the products buyers are considering on a common basis.It specifies what should go into a report that conforms to theGIF, including what is to be included about the test method, theanalysis of data, and the conclusions that can be drawn fromthe analysis. The GIF is intended to be written by usability spe-cialists and read by usability specialists. One of its assumptionsis that given the appropriate data specified by GIF, a usabilityspecialist can measure the usability of a product their companyis considering buying.

The GIF is not intended to apply to all usability tests. It appliesto a summative test done late in the development process tomeasure the usability of a software product, not to diagnosticusability tests conducted earlier in development.

The American National Standards Institute (ANSI) has cre-ated GIF as one of its standards (ANCI/NCITS 354-2001). TheGIF document is available from http://techstreet.com. It is dif-ficult to know how this standard will be used, but it could meanthat in the near future vendors who are selling products to largecompanies could be required to submit a test report in CIFformat.

Are There Ethical Issues in User Testing?

Every organization that does user testing needs a set of policiesand procedures for the treatment of test participants. Most orga-nizations with policies use the federal government or AmericanPsychological Association policies for the treatment of partici-pants in research. At the heart of the policies are the concepts ofinformed consent and minimal risk. Minimal risk means that "theprobability and magnitude of harm or discomfort anticipated inthe test are not greater, in and of themselves, than those ordi-narily encountered in daily life or during the performance ofroutine physical or psychological examination or tests."1 Mostusability tests do not put participants at more than minimal risk.If the test director feels that there may be more than minimalrisk, he or she should follow the procedures described in theNotice of Proposed Rule Making in the Federal Register, 1988,Vol. 53, No. 218, pp. 45661-45682.

Even if the test does not expose participants to more thanminimal risk, testers should have participants read and sign aninformed consent form, which should describe the purpose ofthe test; what will happen during the test, including the record-ing of the session, what will be done with the recording, andwho will be watching the session; and the participants' right toask questions and withdraw from the test at any time. Partici-pants need to have the chance to give their consent voluntarily.For most tests, that means giving them time to read the formand asking them to sign it as an indication of their acceptanceof what is in the form. For an excellent discussion of how to

create and use consent forms see Waters, Carswell, Stephens,and Selwitz (2001).

The special situation in which voluntariness may be in ques-tion can happen when testers sample participants from theirown organizations. The participants have a right to know whowill be watching the session and what will be done with thevideotape. If the participants' bosses or another senior membersof the organization will be watching the sessions, it is difficult todetermine when the consent is voluntary. Withdrawing from thesession may be negatively perceived. The test director needs tobe especially careful in this case to protect participants' rights togive voluntary consent. The same issue arises when the resultsof a test with internal participants are shown in a highlight tape.It that case, the participants need to know before the test thatthe tape of the session might be viewed by people beyond thedevelopment team. Test directors should resist making a high-light tape of any test done with internal participants. If that can'tbe avoided, the person who makes the highlight tape needs tobe careful about showing segments of tape that place the partici-pant in a negative light, even if only in the eyes of the participant.

The names of test participants also need to be kept inconfidence for all tests. Only the test director should be ableto match data with the name of a participant. The participants'names should not be written on data forms or on videotapes.Use numbers or some other code to match the participantwith their data. It is the test director's responsibility to refuseto match names with data, especially when the participants areinternal employees.

This discussion should make it clear that it may be difficult tointerpret subjective measures of usability when the participantsare internal employees. Their incentive to give positive ratingsto the product may be increased when they believe that peoplefrom their company may be able to match their rating with theirname.

Is Testing Web-Based Products Different?

There is nothing fundamentally different about testing Webproducts, but the logistics of such tests can be a challenge(Grouse, Jean-Pierre, Miller, & Goff, 1999). Often the usersof Web-based products are geographically dispersed and maybe more heterogeneous in their characteristics than users ofother technologies. The most important challenge in testingthese products is the speed with which they are developed(Wichansky, 2000). Unlike products with traditional cyclic de-velopment processes, Web products often do not have releasedversions. They are changed on a weekly, if not daily, basis. Fortesting, this means gaining some control over the product beingtested. It needs to be stable while it is tested, not a moving target.

With control comes the pressure to produce results quickly.Conducting a test in 8 to 12 weeks is no longer possible infast-paced development environments. Testing in 1 or 2 weeksis more often the norm now. Testing with such speed is onlypossible in environments where the validity of testing is notquestioned and the test team is experienced.

Notice of Proposed Rule Making in the Federal Register, 1988, Vol. 53, No. 218, p. 45663.

Page 24: User Based Evaluations

56. User-Based Evaluations • 1115

The Future of Usability Testing

Usability testing is clearly the most complex usability evalua-tion method, and we are only beginning to understand theimplications of that complexity. It appears that usability test-ing has entered into a new phase in which its strengths andweaknesses are being seriously debated, although it remainsvery popular and new usability labs continue to open. Before1995 the validity of testing was seldom challenged. The re-cent research has opened up a healthy debate about our as-sumptions about this method. We can never go back to ourearlier innocence about this method, which looks so simplein execution but whose subtleties we are only beginning tounderstand.

WHICH USER-BASED METHOD TO USE?

Deciding which of the user-based evaluation methods to useshould be done in the context of the strengths and weaknesses

of all of the usability inspection methods discussed inChapter 57. Among the user-based methods, direct or videoobservation is useful in special situations. It allows usability spe-cialists to observe populations of users who cannot otherwisebe seen or who can only be observed through the medium ofvideotape. Questionnaires are a useful way to evaluate a broadsample of users, to measure the usability of a product that hasbeen used by the same people over a long period of time, andto sample repeatedly the same user population. The best ques-tionnaires also have the potential to allow usability comparisonsacross products and, perhaps, to provide an absolute measure ofusability. Usability testing can be used throughout the productdevelopment cycle to diagnose usability problems. Its findingshave the most credibility with developers of all of the evalu-ation methods. As currently practiced, tests can be conductedquickly and allow retesting to check whether solutions to usabi-lity problems are effective. Using testing to compare productsor to provide an absolute measure of usability requires moretune and resources and testers who have knowledge of researchdesign and statistics.

References

Abelow, D. (1992). Could usability testing become a built-in productfeature? Common Ground, 2, 1-2.

Andre, T., Williges, R., & Hartson, H. (1999). The effectiveness of us-ability evaluation methods: Determining the appropriate criteria.Proceedings of the Human Factors and Ergonomics Society, 43rdAnnual Meeting (pp. 1090-1094). Santa Monica, CA: Human Fac-tors and Ergonomics Society.

Baber, C., & Stanton, N. (1996). Observation as a technique for usabilityevaluation. In P. Jordan, B. Thomas, B. Weerdmeester, &I. McClelland(Eds.), Usability evaluation in industry (pp. 85-94). London: Taylor& Francis.

Bailey, R. W. (1993). Performance vs. preference. Proceedings of theHuman Factors and Ergonomics Society, 37th Annual Meeting(pp. 282-286). Santa Monica, CA: Human Factors and ErgonomicsSociety.

Bailey, R. W., Allan, R. W., & Raiello, P. (1992). Usability testing vs.heuristic evaluation: A head-to-head comparison. Proceedings of theHuman Factors and Ergonomics Society, 36th Annual Meeting(pp. 409-413). Santa Monica, CA: Human Factors and ErgonomicsSociety.

Barker, R. T., & Biers, D. W. (1994). Software usability testing: Do userself-consciousness and the laboratory environment make any dif-ference? Proceedings of the Human Factors Society, 38th AnnualMeeting (pp. 1131-1134). Santa Monica, CA: Human Factors andErgonomics Society.

Bauersfeld, K., & Halgren, S. (1996). "You've got three days!" Casestudies in field techniques for the time-challenged. In D. Wixon& J. Ramey (Eds.), Field methods casebook for software design(pp. 177-196). New York: John Wiley.

Beyer, H., & Holtzblatt, K. (1997). Contextual design: Designingcustomer-centered systems. San Francisco: Morgan Kaufmann.

Bias, R. (1994). The pluralistic usability walkthrough: Coordinated em-pathies. In J. Nielsen & R. Mack (Eds.), Usability inspection methods(pp. 63-76). New York: John Wiley.

Boren, M., & Ramey, J. (2000, September). Thinking aloud: Reconcilingtheory and practice. IEEE Transactions on Professional Communi-cation, 1-23.

Bowers, V, & Snyder, H. (1990). Concurrent versus retrospective verbalprotocols for comparing window usability. Proceedings of theHuman Factors Society, 34th Annual Meeting (pp. 1270-1274).Santa Monica, CA: Human Factors and Ergonomics Society.

Bradburn, N. (1983). Response effects. In R. Rossi, M. Wright, & J.Anderson (Eds.), The handbook of survey research (pp. 289-328).New York: Academic Press.

Branaghan, R. (1997). Ten tips for selecting usability test participants.Common Ground, 7, 3-6.

Branaghan, R. (1998). Tasks for testing documentation usability.Common Ground, 8, 10-11.

Brooke, J. (1996). SUS: A quick and dirty usability scale. In P. Jordan, B.Thomas, B. Weerdmeester, & I. McClelland (Eds.), Usability evalu-ation in industry (pp. 189-194). London: Taylor & Francis.

Cantani, M. B., & Biers, D. W. (1998). Usability evaluation and proto-type fidelity: Users and usability professionals. Proceedings of theHuman Factors Society, 42nd Annual Meeting (pp. 1331-1335).Santa Monica, CA: Human Factors and Ergonomics Society.

Chignell, M. (1990). A taxonomy of user interface terminology. SIGCHIBulletin, 21, 27-34.

Chin, J. P., Diehl, V. A., & Norman, K. L. (1988). Development of aninstrument measuring user satisfaction of the human-computer in-terface. Proceedings of Human Factors in Computing Systems '88,213-218.

Desurvire, H. W. (1994). Faster, cheaper! Are usability inspection meth-ods as effective as empirical testing? In J. Nielsen & R. Mack (Eds.),Usability inspection methods (pp. 173-202). New York: JohnWiley.

DeVries, C., Hartevelt, M., & Oosterholt, R. (1996). Private camera con-versation: A new method for eliciting user responses. In P. Jordan,B. Thomas, B. Weerdmeester, & I. McClelland (Eds.), Usability eval-uation in industry (pp. 147-156). London: Taylor & Francis.

Dobroth, K. (1999, May). Practical guidance for conducting usabilitytests of speech applications. Paper presented at the annual meetingof the American Voice I/O Society (AVIOS). San Diego, CA.

Doll, W., & Torkzadeh, G. (1988). The measurement of end-user com-puting satisfaction. MIS Quarterly, 12, 259-374.

Page 25: User Based Evaluations

1116 • DUMAS

Dumas, J. (1998a). Usability testing methods: Using test participants astheir own controls. Common Ground, 8, 3-5.

Dumas, J. (1998b). Usability testing methods: Subjective measuresPart I—Creating effective questions and answers. Common Ground,8, 5-10.

Dumas, J. (1998c). Usability testing methods: Subjective measuresPart II—Measuring attitudes and opinions. Common Ground, 8,4-8.

Dumas, J. (1999). Usability testing methods: When does a usability testbecome a research experiment? Common Ground, 9, 1-5.

Dumas, J. (2000). Usability testing methods: The fidelity of the testingenvironment. Common Ground, 10, 3-5.

Dumas, J. (2001). Usability testing methods: Think-aloud protocols. InR. Branghan (Ed.), Design by people for people: Essays on usability.Chicago: Usability Professionals' Association.

Dumas, J., & Redish, G. (1993). A practical guide to usability testing.NJ: Ablex.

Dumas, J., & Redish, G. (1999). A practical guide to usability testing(Rev. ed.). London: Intellect Books.

Ericsson, K. A., & Simon, H. A. (1993). Protocol Analysis: Verbal Reportsas Data. Cambridge, MA: MIT Press.

Fisher, R. A., & Yates, F. (1963). Statistical tables for biological, agricul-tural and medical research. Edinburgh, Scotland: Oliver & Boyd.

Frokjaer, E., Hertzum, M., & Hornbaek, K. (2000). Measuring usabil-ity: Are effectiveness, efficiency, and satisfaction really correlated?Proceedings of Human Factors in Computing Systems '2000,45-52.

Fu, L, Salvendy, G., & Turley, L. (1998). Who finds what in usabilityevaluation. Proceedings of the Human Factors and ErgonomicsSociety, 42nd Annual Meeting (pp. 1341-1345). Santa Monica, CA:Human Factors and Ergonomics Society.

Gaba, D. M. (1994). Human performance in dynamic medical domains.In M. S. Bogner (Ed.), Human error in medicine (pp. 197-224).Hillsdale, NJ: Lawrence Erlbaum Associates.

Gage, N., & Berliner, D. (1991). Educational psychology (5th ed.).New York: Houghton Mifflin.

Goldberg, J. H. (2000). Eye movement-based interface evaluation: Whatcan and cannot be assessed? Proceedings of the IEA 2000/HFES2000 Congress (44th Annual Meeting of the Human Factors andErgonomics Society) (pp. 625-628). Santa Monica, CA: Human Fac-tors and Ergonomics Society.

Gray, W., & Salzman, M. (1998). Damaged merchandise? A review of ex-periments that compare usability methods [Special Issue]. Human-Computer Interaction, 13, 203-261.

Grouse, E., Jean-Pierre, S., Miller, D., & Goff, R. (1999). Applying us-ability methods to a large intranet site. Proceedings of the HumanFactors and Ergonomics Society, 43rd Annual Meeting (pp. 782-786). Santa Monica, CA: Human Factors and Ergonomics Society.

Ground, C., & Ensing, A. (1999). Apple pie a-la-mode: Combining sub-jective and performance data in human-computer interaction tasks.Proceedings of the Human Factors and Ergonomics Society, 43rdAnnual Meeting (pp. 1085-1089). Santa Monica, CA: Human Fac-tors and Ergonomics Society.

Hackman, G. S., & Biers, D. W. (1992). Team usability testing: Are twoheads better than one? Proceedings of the Human Factors Society,36th Annual Meeting (pp. 1205-1209). Santa Monica, CA: HumanFactors and Ergonomics Society.

Hartson, H. R., Castillo, J. C., Kelso, J., Kamler, J., & Neale, W. C. (1996).Remote evaluation: The network as an extension of the usabilitylaboratory. Proceedings of Human Factors in Computing Systems'96, 228-235.

Hassenzahl, M. (1999). Usability engineers as clinicians. CommonGround, 9, 12-13.

Hughes, M. (1999). Rigor in usability testing. Technical Communica-tion, 46, 488-494.

Igbaria, M., & Parasuraman, S. (1991). Attitudes towards microcom-puters: Development and construct validation of a measure. Inter-national Journal of Man-Machine Studies, 34, 553-573.

Jacobsen, N., & John, B. (1998). The evaluator effect in usabilitystudies: Problem detection and severity judgments. Proceedings ofthe Human Factors and Ergonomics Society, 42nd Annual Meeting(pp. 1336-1340). Santa Monica, CA: Human Factors and ErgonomicsSociety.

Jeffries, R., Miller, J., Wharton, C., & Uyeda, K. (1991). User interfaceevaluation in the real world: A comparison of four techniques. Pro-ceedings of Human Factors in Computing Systems '91, 119-124.

Kantner, L. (2001a). Following a fast-moving target: Recording userbehavior in Web usability testing. In R. Branaghan (Ed.), Designby people for people: Essays on usability (pp. 235-244). Chicago:Usability Professional's Association.

Kantner, L. (2001b). Assessing Web site usability from server log files. InR. Branaghan (Ed.), Design by people for people: Essays on usability(pp. 245-261). Chicago: Usability Professional's Association.

Karat, C. M., Campbell, R., & Fiegel, T. (1992). Comparison of empiricaltesting and walk-through methods in user-interface evaluation. Pro-ceedings of Human Factors in Computing Systems '92, 397-404.

Kennedy, S. (1989). Using video in the BNR usability lab. SIGCHIBulletin, 21, 92-95.

Kirakowski, J. (1996). The software usability measurement inven-tory (SUMI): Background and usage. In P. Jordan, B. Thomas, B.Weerdmeester, & I. McClelland (Eds.), Usability evaluation in in-dustry (pp. 169-177). London: Taylor & Francis.

Kirakowski, J., & Corbett, M. (1988). Measuring user satisfaction. In D.Jones & R. Winder (Eds.), People and computers (Vol. IV, pp. 189-217). Cambridge, England: Cambridge University Press.

Landauer, T. K. (1995). The trouble with computers. Cambridge, MA:MIT Press.

Landay J. A., & Myers, B. (1995). Interactive sketching for the earlystages of user interface design. Proceedings of Human Factors inComputing Systems '95, 43-50.

Law, C. M., & Vanderheiden, G. C. (2000). Reducing sample sizeswhen user testing with people who have, and who are simulat-ing disabilities—experiences with blindness and public informa-tion kiosks. Proceedings of the IDEA 2000/HFES 2000 Congress,4, 157-160. Santa Monica, CA: Human Factors and ErgonomicsSociety.

Ledgard, H. (1982). Evaluating text editors. Proceedings of HumanFactors in Computer System, 135-156.

Lesaigle, E. M., & Biers, D. W. (2000). Effect of type of information orreal-time usability evaluation: Implications for remote usability test-ing. Proceedings of the IEA 2000/HFES 2000 Congress, 6, 585-588.Santa Monica, CA: Human Factors and Ergonomics Society.

Lewis, J. (1991). Psychometric evaluation of an after-scenario question-naire for computer usability studies: The ASQ. SICCHI Bulletin, 23,78-81.

Lewis, J. (1994). Sample size for usability studies: Additional considera-tions. Human Factors, 36, 368-378.

Lewis, J. R. (1995). IBM computer usability satisfaction questionnaires:Psychometric evaluation and instructions for use. InternationalJournal of Human-Computer Interaction, 7, 57-78.

Lister, M. (2001). Usability testing software for the Internet. Proceedingsof Human Factors in Computing Systems 2001, 3, 17-18.

Lund, A. M. (1998). The need for a standardized set of usability metrics.Proceedings of the Human Factors and Ergonomics Society, 42ndAnnual Meeting (pp. 688-691). Santa Monica, CA: Human Factorsand Ergonomics Society.

Page 26: User Based Evaluations

56. User-Based Evaluations • 1117

Meister, D. (1999). The history of human factors and ergonomics.Mahwah, NJ: Lawrence Erlbaum Associates.

Mitropoulos-Rundus, D., & Muzak, J. (1997). How to design and conducta consumer in-home usability test. Common Ground, 7, 10-12.

Molich, R., Sevan, N., Curson, I., Butler, S., Kindlund, E., Miller, D.,& Kirakowski, J. (1998). Comparative evaluation of usability tests.Proceedings of the Usability Professionals'Association (pp. 1-12).Dallas, TX: Usability Professionals' Association.

Molich, R., Kindlund, E., Seeley, J., Norman, K., Kaasgaard, K.,Karyukina, B., Schmidt, L., Ede, M., van Oel, W., & Kahmann, R.(2002). Comparative usability evaluation. In press.

Nielsen, J. (1992). Finding usability problems through heuristic evalu-ation. Proceedings of Human Factors in Computing Systems '92(pp. 373-380).

Nielsen, J., & Phillips, V L. (1993). Estimating the relative usabilityof two interfaces: Heuristic, formal, and empirical methods com-pared. Proceedings of the Association of Computerized MachineryINTERCHI '93 Conference on Human Factors in ComputingSystems (pp. 214-221). New York: ACM Press.

Olson, G., & Moran, R. (1998). Damaged merchandise? A review of ex-periments that compare usability methods [Special Issue]. Human-Computer Interaction, 13, 203-261.

Orne, M. (1969). Demand characteristics and the concept of quasi-controls. In R. Rosenthal & R. Rosnow (Eds.), Artifact in behavioralresearch (pp. 143-179). New York: Academic Press.

Perkins, R. (2001). Remote usability evaluation over the Internet. In R.Branaghan (Ed.), Design by people for people: Essays on usability(pp. 153-162). Chicago: Usability Professional's Association.

Philips, B., & Dumas, J. (1990). Usability testing: Functional require-ments for data logging software. Proceedings of the Human FactorsSociety, 34th Annual Meeting (pp. 295-299). Santa Monica, CA:Human Factors and Ergonomics Society.

Rubin, J. (1994). Handbook of usability testing. New York: John Wiley.Scholtz, J., & Bouchette, D. (1995). Usability testing and group-based

software: Lessons from the field. Common Ground, 5, 1-11.Shneiderman, B. (1987). Designing the user interface: Strategies for

effective human computer interaction. Reading, MA: Addison-Wesley.

Shneiderman, B. (1992). Designing the user interface: Strategies foreffective human computer interaction (2nd ed.). Reading, MA:Addison-Wesley.

Shneiderman, B. (1997). Designing the user interface: Strategies for

effective human computer interaction (3rd ed.). Reading, MA:Addison-Wesley.

Skinner, B. F. (1956). A case history in scientific method. AmericanPsychologist, 11, 221-233.

Spenkelink, G., Beuijen, K., & Brok, J. (1993). An instrument for measure-ment of the visual quality of displays. Behaviour and InformationTechnology, 12, 249-260.

Thomas, B. (1996). Quick and dirty usability tests, hi P. Jordan, B.Thomas, B. Weerdmeester, & I. McClelland (Eds.), Usability eval-uation in industry (pp. 107-114). London: Taylor & Francis.

Virzi, R. A. (1990). Streamlining the design process: Running fewersubjects. Proceedings of the Human Factors Society, 34thAnnual Meeting (pp. 291-294). Santa Monica, CA: Human Factorsand Ergonomics Society.

Virzi, R. A. (1992). Refining the test phase of usability evaluation: Howmany subjects is enough? Human Factors, 34, 457-468.

Virzi, R. A., Sokolov, J. L., & Karis, D. (1996). Usability problem identifi-cation using both low and high fidelity prototypes. Proceedings ofHuman Factors in Computing Systems '96, 236-243.

Virzi, R. A., Sorce, J. E, & Herbert, L. B. (1993). A comparison of threeusability evaluation methods: Heuristic, think-aloud, and perfor-mance testing. Proceedings of the Human Factors and ErgonomicsSociety, 37th Annual Meeting, 309-313.

Vora, P. (1994). Using teaching methods for usability evaluations.Common Ground, 4, 5-9.

Waters, S., Carswell, M., Stephens, R., & Selwitz, A. (2001). Researchethics meets usability testing. Ergonomics in Design, 9, 14-20.

Wichansky A. (2000). Usability testing in 2000 and beyond. Ergonomics,43, 998-1006.

Wiklund, M., Dumas, J., & Thurrott, C. (1992). Does the fidelityof software prototypes affect the perception of usability? Pro-ceedings of the Human Factors Society, 36th Annual Meeting(pp. 1207-1212). Santa Monica, CA: Human Factors and ErgonomicsSociety.

Wilson, C. E., & Coyne, K. P. (2001). Tracking usability issues: To bugor not to bug? Interactions, 8, 15-19.

Wolf, C. G. (1989). The role of laboratory experiments in HCI: Help,hindrance or ho-hum? Proceedings of Human Factors in Comput-ing Systems '89, 265-268.

Young, R., & Barnard, R (1987). The use of scenarios in HCI research:Turbo charging the tortoise of cumulative science. Proceedings ofHuman Factors in Computing Systems '87, 291-296.