Evaluation

evaluation, validation andempirical methods

Alan Dix

http://www.alandix.com/

evaluation

you’ve designed it, but is it right?

different kinds of evaluation

endless argumentsquantitative vs. qualitative

in the lab vs. in the wild

experts vs. real users (vs UG students!)

really need tocombine methods

quantitative – what is true & qualitative – why

what is appropriate and possible

purpose

Two types of evaluation

purpose stage

formative improve a design development

summative say “this is good” contractual/sales

investigative gain understanding researchinvestigative/ exploratory

researchgain understanding

when does it end?

in a world of perpetual beta ...

real use is the ultimate evaluation

logging, bug reporting, etc.

how do people really use the product?

are some features never used?

studies and experiments

what varies (and what you choose)

individuals / groups (not only UG students!)

tasks / activities

products / systems

principles / theories

prior knowledge and experience

learning and order effects

which are you trying to find out about?which are ‘noise’

a little story …

BIG ACM sponsored conference

‘good’ empirical paper

looking at collaborative support for a task X

three pieces of software:A – domain specific software, synchronous

B – generic software, synchronous

C – generic software, asynchronous A

domainspec.

generic

experiment

reasonable nos. subjects in each condition

quality measures

significant results p<0.05domain spec. > generic

asynchronous > synchronous

conclusion: really want async domain specific

domainspec.

generic

domainspec.

generic

asyncsync

conclusion: really want async domain specific

what’s wrong with that?

interaction effectsgap is interesting to studynot necessarily end up best

more important … if you blinked at the wrong moment …

NOT independent variablesthree different pieces of softwarelike experiment on 3 people!say system B was just bad

domainspec.

generic asyncsync

domainspec.

generic

B < A B < C

what went wrong?

borrowed psych method… but method embodies assumptions

single simple cause, controlled environment

interaction needs ecologically valid exp.multiple causes, open situations

what to do? understand assumptions and modify

numbers and statistics

are five users enough?

one of the myths of usability!from a study by Nielsen and Landauer (1993)

empirical work, cost–benefit analysis and averages, many assumptions: simplified model, iterative

steps, ...

basic idea: decreasing returnseach extra user gives less new information

really ... it dependsfor robust statistics – many many morefor something interesting – one may be enough

points of comparison

measures:average satisfaction 3.2 on a 5 point scale

time to complete task in range 13.2–27.6 seconds

good or bad?

need a point of comparisonbut what?

self, similar system, created or real??

think purpose ...

what constitutes a ‘control’think!!

do I need statistics?

finding some problem to fix

to know

how frequently it occurs

whether most users experience it

if you’ve found most problems

statistics

need a course in itself!experimental design

choosing right test

etc., etc., etc.

a few things ...

statistical significance

stat. sig = likelihood of seeing effect by chance5% (p <0.05) = 1 in 20 chancebeware many tests and cherry picking!10 tests means 50:50 chance of seeing p<0.05

not necessarily large effect (i.e. ≠ important)

non-significant = not proven (NOT no effect)may simply not be sensitive enoughe.g. too few users

to show no (small) effect need other methodsfind out about confidence intervals!

statistical power

how likely effect will show up in experiment

more users means more ‘power’2x senisitivity needs 4x number of

manipulate it!

more users (but usually many more)

within subject/group (‘cancels’ individual diffs.)

choice of task (particularly good/bad)

add distracter task

from data to knowledge

types of knowledge

descriptiveexplaining what happened

predictivesaying what will happen

cause ⇒ effectwhere science often ends

• synthetic– working out what to do to make what you want

happeneffect ⇒ cause

– design and engineering

syntheticworking out what to do to make what you want happen

effect ⇒ causedesign and engineering

generalisation?

can we ever generalise?

every situation is unique, but ...

... to use past experience is to generalise

generalisation ≠ abstractioncases, descriptive frameworks, etc.

data ≠ generalistioninterpolation – maybe

extrapolation??

generalisation ...

never comes (solely) from data

always comes from the head

requires understanding

mechanism

reduction reconstruction– formal hypothesis testing+ may be qualitative too– more scientific precision

• wholistic analytic– field studies, ethnographies+ ‘end to end’ experiments– more ecological validity

wholistic analytic– field studies, ethnographies

+ ‘end to end’ experiments– more ecological validity

? ? ? ? ?

from evaluation to validation

validating work

• justification– expert opinion– previous research– new experiments

• evaluation– experiments– user studies– peer review

your work

evaluation

experimentsuser studiespeer review

singularity?different peopledifferent situations

sampling

generative artefacts

artefact

evaluationsingularitypeople, situationsplus ...different designersdifferent briefs

toolkitsdevicesinterfacesguidelinesmethodologies

(pure) evaluation of generative artefacts is methodologically unsound

too many

to sample

validating work

your work

evaluation

experimentsuser studiespeer review

justification

expert opinionprevious researchnew experiments

justification vs. validation

• different disciplines– mathematics: proof = justification– medicine: drug trials = evaluation

• combine them:– look for weakness in justification– focus evaluation there

evaluationjustification

example – scroll arrows ...

Xerox STAR – first commercial GUIprecursor of Mac, Windows, ...

principled design decisions

which direction for scroll arrows?not obvious: moving document or handle?

=> do a user study!

gap in justification => evaluation

unfortunately ...

Apple got the wrong designs

Evaluation

Education

Evaluation,Evaluators and Evaluation Culture

Predictive Evaluation (Evaluation Without Users)

rations Evaluation - Operations Evaluation - Operations Evaluation - Operations Evaluation - Operations Evaluation - Operations Evaluation - Operations Evaluation

Final Evaluation Report - NCA GBV Programme - With annexes ...€¦ · Final Evaluation Report Evaluation Countries Lebanon, Iraq, Syria Evaluation Date May – August 2019 Evaluation

Independent Evaluation of the - ilo.int · evaluation consultant, Craig Russon, Senior Evaluation Officer in the ILO Evaluation Unit, and Leya Cattleya, national evaluation consultant

EVALUATION OF INDEXING SYSTEMdlis.du.ac.in/eresources/Evaluation of IRS.pdf · EVALUATION STEPS AND EVALUATION STAGES. (CONTD…) 3. Execution of the evaluation 4. Analysis and interpretation

Worldspec Group of Companies PRESENTS AN INTRODUCTION TO EVALUATION NONDESTRUCTIVE EVALUATION AN INTRODUCTION TO EVALUATION NONDESTRUCTIVE EVALUATION

LIBIDINAL CIRCUITS: Evaluation Report. - … evaluation web.pdfLIBIDINAL CIRCUITS: Evaluation Report. Summary: 3 Background: 3 Scope: 3 Evaluation Process: 3 Evaluation: 4 Social Media

Evaluation Policy and Evaluation Practice

POLICY STRATEGY - OECD€¦ · policy strategy POLICY EVALUATION EVALUATION EVALUATION evaluation EVALUATION POLICY POLICY STRATEGY ... Evaluation at the AfDB is to enhance the development

EVALUATION SUBJECT EVALUATION SCOPE: EVALUATION …

Final evaluation of EFICAS project evaluation EVALUATION€¦ · 1 . EVALUATION REPORT Final evaluation of EFICAS project evaluation . DCI-ENV/2013/303-300 . TREBOUX Marion VONGSANA

Evaluation of Training Rationale for Evaluation Types of Evaluation Data Validity Issue Evaluation Design

Project evaluation specification evaluation

Personnel Evaluation Evaluation Reporting System PAM 623-3.pdfPersonnel Evaluation Evaluation Reporting System ... Officer evaluation report and noncommissioned officer evaluation

Overview Performance Evaluation€¦ · Performance Evaluation. Our mission today Why annual performance evaluation? What is part of the evaluation? UF process Evaluation best practices

Evaluation a2 evaluation

Evaluation 4 AQWERT- Evaluation

Evaluation Learning Series #4: Evaluation Design & …...Evaluation design; CDC evaluation framework standards; indicators; performance measures; types of evaluation designs; experimental;

UNIDO Evaluation Group UNDP Evaluation Officeweb.undp.org/evaluation/documents/thematic/unido/... · UNIDO Evaluation Group UNDP Evaluation Office ... (Armenia, Lao PDR, Nicaragua