Evaluation and metrics: Measuring the effectiveness of virtual environments Doug Bowman Edited by C. Song

Evaluation and metrics: Measuring the effectiveness of

virtual environments

Doug Bowman

Edited by C. Song

(C) 2005 Doug Bowman, Virginia Tech 2

11.2.2 Types of evaluation

Cognitive walkthrough

Heuristic evaluation

Formative evaluation Observational user studies Questionnaires, interviews

Summative evaluation Task-based usability evaluation Formal experimentation

Sequentialevaluation

Testbedevaluation


11.5 Classifying evaluation techniques

Ÿ Form al Sum m ativeEvaluation

Ÿ Post-hoc Q uestionnaire

Ÿ (generic perform ancem odels for VEs (e.g., fitt'slaw))

Ÿ In form al Sum m ativeEvaluation

Ÿ Post-hoc Q uestionnaire

Ÿ H euris tic Evaluation

Ÿ Form ative EvaluationŸ Form al Sum m ative

EvaluationŸ Post-hoc Q uestionnaire

Ÿ Form ative Evaluation(in form al and form al)

Ÿ Post-hoc Q uestionnaireŸ In terview / D em o

Ÿ (application-specificperform ance m odels forVEs (e.g., G O M S))

Ÿ H euris tic EvaluationŸ C ognitive W alk through

Generic

Quantitative

Qualitative

Requires Users Does Not Require Users

{Quantitative

Qualitative

U s e r I n v o l v e m e n t

C o

n t

e x

t

o f

E

v a

l u

a t

i o

n

T y

p e

o f R

e s

u l t s

ApplicationSpecific{

Generic

Qualitative

Quantitative

Application-specific

Qualitative

Quantitative


11.4 How VE evaluation is different

Physical issuesUser can’t see world in HMDThink-aloud and speech incompatible

Evaluator issuesEvaluator can break presenceMultiple evaluators usually needed


11.4 How VE evaluation is different (cont.)

User issuesVery few expert usersEvaluations must include rest breaks to

avoid possible sickness

Evaluation type issuesLack of heuristics/guidelinesChoosing independent variables is difficult


11.4 How VE evaluation is different (cont.)

Miscellaneous issuesEvaluations must focus on lower-level

entities (ITs) because of lack of standardsResults difficult to generalize because of

differences in VE systems


11.6.1 Testbed evaluation framework

Main independent variables: ITs

Other considerations (independent variables) task (e.g. target known vs. target unknown) environment (e.g. number of obstacles) system (e.g. use of collision detection) user (e.g. VE experience)

Performance metrics (dependent variables) Speed, accuracy, user comfort, spatial awareness…

Generic evaluation context


Testbed evaluation

User-centered Application8

Heuristics&

Guidelines

7

QuantitativePerform ance

Results

6

T e s t b e dE v a l u a t i o n

5

2Taxonom y

Outside Factorstask, users, evnironm ent,

system

3 4 Perform anceM etrics

Initial Evaluation1


Taxonomy

Establish a taxonomy of interaction technique for the interaction task being evaluate.

Example : Task: Changing the object’s color 3 sub tasks :

selecting object Choosing a color Applying color

2 possible technique components (TC) for choosing a color Changing the values of R, G and B sliders Touching a point within a 3D color space


Outside Factors

A user’s performance on an interaction task may depend on a variety of factors.

4 categories Task

Distance to be traveled, size of object to be manipulated Environment

The number of obstacles, the level of activity or motion User

Spatial awareness, physical attributes (arm length, etc) System

Lighting model, the mean frame rate etc.


Performance Metrics

Information about human performance

Speed, Accuracy : quantitative

More subjective performance valuesEase of use, ease of learning, and user

comfortThe user’s sense and body, user-centric

performance measure


Testbed Evaluation

Final stages in the evaluation of Interaction techniques for 3D Interaction tasks

Generic, generalizable, and reusable evaluation through the creations of test-beds.

Test-beds : Environments and tasks Involve all important aspects of a task Evaluate each component of a technique Consider outside influences on performance Have multiple performance measures


Application and Generalization of Results Testbed evaluation produces models that characterize the

usability of an interaction technique for the specified task. Usability is given in terms of multiple performance metrics w.r.t

various lelvels of outside factors. -> performance Database(DB) More information is added to the DB each time a new technique is

run through the testbed.

To choose interaction techniques for applications appropriately, one must understand the interaction requirements of the application The performance results from testbed evaluation can be used to

recommend interaction techniques that meet those requirements.


11.6.2 Sequential evaluation

Traditional usability engineering methods

Iterative design/eval.

Relies on scenarios, guidelines

Application-centric

User-centered Application

(D )R epresentative

U serT ask

Scenarios

(C )S tream lined

U ser In terfaceD esigns

(1)User TaskAnalysis

(3)Formative

User-CenteredEvaluation

(4)Summative

ComparativeEvaluation

(2)Heuristic

Evaluation

(A )T ask

D escriptionsSequences &D ependencies

(E )Iterative ly R efined

U ser In terfaceD esigns

(B)G uidelines

andH euris tics


11.3 When is a VE effective?

Users’ goals are realized

User tasks done better, easier, or faster

Users are not frustrated

Users are not uncomfortable


11.3 How can we measure effectiveness?

System performance

Interface performance / User preference

User (task) performance

All are interrelated


Effectiveness case studies

Watson experiment: how system performance affects task performance

Slater experiments: how presence is affected

Design education: task effectiveness


11.3.1 System performance metrics

Avg. frame rate (fps)

Avg. latency / lag (msec)

Variability in frame rate / lag

Network delay

Distortion


System performance

Only important for its effects on user performance / preference frame rate affects presencenet delay affects collaboration

Necessary, but not sufficient


Case studies - Watson

How does system performance affect task performance?

Vary avg. frame rate, variability in frame rate

Measure perf. on closed-loop, open-loop task

e.g. B. Watson et al, Effects of variation in system responsiveness on user performance in virtual environments. Human Factors, 40(3), 403-414.


11.3.3 User preference metrics

Ease of use / learning

Presence

User comfort

Usually subjective (measured in questionnaires, interviews)


User preference in the interface

UI goalsease of useease of learningaffordancesunobtrusivenessetc.

Achieving these goals leads to usability

Crucial for effective applications


Case studies - Slater

questionnaires

assumes that presence is required for some applications

e.g. M. Slater et al, Taking Steps: The influence of a walking metaphor on presence in virtual reality. ACM TOCHI, 2(3), 201-219.

study effect of:collision detectionphysical walkingvirtual bodyshadowsmovement


User comfort

Simulator sickness

Aftereffects of VE exposure

Arm/hand strain

Eye strain


Measuring user comfort

Rating scales

QuestionnairesKennedy - SSQ

Objective measuresStanney - measuring aftereffects


11.3.2 Task performance metrics

Speed / efficiency

Accuracy

Domain-specific metricsEducation: learningTraining: spatial awarenessDesign: expressiveness


Speed-accuracy tradeoff

Subjects will make a decision

Must explicitly look at particular points on the curve

Manage tradeoffSpeed

Acc

urac

y


Case studies: learning

Measure effectiveness by learning vs. control group

Metric: standard test

Issue: time on task not the same for all groups

e.g. D. Bowman et al. The educational value of an information-rich virtual environment. Presence: Teleoperators and Virtual Environments, 8(3), June 1999, 317-331.


Aspects of performance

SystemPerformance

InterfacePerformance Task

Performance

Effectiveness


11.7 Guidelines for 3D UI evaluation

Begin with informal evaluation

Acknowledge and plan for the differences between traditional UI and 3D UI evaluation

Choose an evaluation approach that meets your requirements

Use a wide range of metrics – not just speed of task completion


Guidelines for formal experiments

Design experiments with general applicability Generic tasks Generic performance metrics Easy mappings to applications

Use pilot studies to determine which variables should be tested in the main experiment

Look for interactions between variables – rarely will a single technique be the best in all situations


Acknowledgments

Deborah Hix

Joseph Gabbard

Documents

Evaluation and metrics: Measuring the effectiveness of virtual environments Doug Bowman Edited by C. Song