Title Text - univie.ac.atvda.univie.ac.at/Teaching/HCI/18s/LectureNotes/09_Lab... · 2018-05-15 · • R. Matthews, “Storks Deliver Babies”. Journal of Teaching Statistics, vol

Title Text

Evaluation: Scientific Studies

1

Evaluation Beyond Usability Tests

2

• Expert tests / walkthroughs• Usability Tests with users

• Main goal: formative– identify usability problems– improve the tool

Usability Evaluation (last week)

3

Summative Evaluation (focus today)

• How good is it? Useful?• Better than other tools?

4

Formative and Summative: Usually combined

Evaluation over time

formative summative

5

Evaluation goals (summative)

• Generalizability– Results can be applied to other people

• Precision– We measured what we wanted to measure

(controlling factors that were not intended to study)

• Realism– Study context is realistic

... usually trade-off between them!6

The selection of a research method depends on the research question and the object under study! 7© McGrath / Carpendale

Controlled Experiments

8

Controlled experiment

• Or:– Laboratory Experiment – Lab study – User Study– A/B Testing (used in marketing)– …

9

Focus

• Precision• Generalizability (?)

• Overall goal– Reveal cause-effect relationships– e.g. smoking causes cancer

10

Scenario

A B

Which is better?11

© Carpendale

Test it with users!

12

Hypothesis

• A precise problem statement• Example:

– H1 = Participants will buy more beer when using variant B than variant A

– Null-Hypothesis H0 = no difference in beer purchase

A B

13

Independent Variables

• Factors to be studied• Typical independent variables (in HCI)

– Different types of design– Task type: e.g., searching/browsing– Participant demographics: e.g., male/female – Different technologies: touch pad vs. keyboard

• Control of Independent Variable– Levels: The number of variables in each factor– Limited by the length of the study and the number of

participants• How different?

– Entire interfaces vs. very specific parts

A

B

14

How different?entire interfaces: I can’t tell what actually causes the difference (not so good for research?)very specific parts: expensive (not so good for industry?)

Control Environment

• Make sure nothing else could cause your effect

• Control confounding variables• Randomization!

A

B

15

e.g. naturally observe and see that girls are always using A, and guys Bis it that they buys more beer because of interface A or because they are girlscan’t say anything about causality

Different Designs: Between-Subjects

• Divide the participants into groups, each group does one condition

• Randomize: Group Assignment• Potential problem?

A

B

Group 1

Group 216

Different Designs: Within-Subjects

• Everybody does all the conditions• Can account for individual differences and

reduce noise (that’s why it may be more powerful and requires less participants)

• Severely limits the number of conditions, and even types of tasks tested (may be able to workaround by having multiple sessions)

• Can lead to ordering effects —> Randomize Order

A

B

17

Common: Ordering Effects:* Learning effect: Did everybody use the interface in a certain order? If so, are people faster because they are more practiced, or because of the effect of the interface?* Fatigue / Boredom

Dependent Variable

• The things that you measure• Performance indicators:

– task completion time, error rates, mouse movement…

– (numbers of beers bought)• Subjective participant feedback:

– satisfaction ratings, closed-ended questions, interviews…

– questionnaires (HCI lecture last week)• Observations:

– behaviors, signs of frustrations…18

Tasks

• Specifying good tasks for controlled experiments is tricky– Specifically, if you are measuring performance

criteria• Task criteria

– comparability for different interfaces– clear end point

• Example– usability test: >>buy a book for a 4 year old<<– controlled experiment: >>find and buy the book

Doctor Faustus by Thomas Mann<<19

Results: Application of Statistics

• Descriptive Statistics– Describes the data you gathered (e.g. visually)

• Inferential Statistics– Make predictions/inferences from your study

to the larger population

20

Descriptive statistics

• Central tendency– mean {1, 2, 4, 5} – median {15, 19, 22, 29, 33, 45, 50} – mode {12, 15, 22, 22, 22, 34, 34}

21


• Central tendency– mean {1, 2, 4, 5} 3– median {15, 19, 22, 29, 33, 45, 50} 29– mode {12, 15, 22, 22, 22, 34, 34} 22

22


• Central tendency– mean {1, 2, 4, 5} 3– median {15, 19, 22, 29, 33, 45, 50} 29– mode {12, 15, 22, 22, 22, 34, 34} 22

• Measures of spread– range– variance– standard deviation

note: for inferential standard deviation N becomes (N-1) —> estimate for sampled population23

Visualization of descriptive statistics

• Mean• 25/75% Quartiles• Min / Max• (alternative: with outliers)

e.g., Boxplot

24

Boxplots are not completely standardized - different possible interpretations of Whiskers/Outliers/.. what you are showing

—> Important: Describe what you are visually encoding

Validity

• Errors:– Type I: False positives– Type II: False negatives

• External Validity– Can we generalize the study?– E.g. generalizable to the larger population of

undergrad students• Internal Validity

– Is there a causal relationship?– Are there alternate causes?

25

Inferential statistics

• Goal: Generalize findings to the larger population

http://www.latrobe.edu.au/psy/research/cognitive-and-developmental-psychology/esci 26

Excursus: Tragedy of the error bars

CI = Confidence intervals

SE = Standard Error (SD of the sampling distribution of the sample mean)

SD = Standard Deviation

27

Excursus: 95% Confidence intervals

• USE THEM!• Interpretation: We can be 95% confident

that the real mean lies within our confidence interval!

• More intuition about stats:– Seeing theory:

http://students.brown.edu/seeing-theory/

28

http://students.brown.edu/seeing-theory/

Null Hypothesis Testing

• Statistically significant results– p < .05– The probability that we incorrectly reject the

Null-Hypothesis (Type I error)• Many different tests

– t-test, ANOVA, …

A B

29CI

Internal Validity: Storks deliver babies!?

• R. Matthews, “Storks Deliver Babies”. Journal of Teaching Statistics, vol. 22, issue 2, pages 36-38, 2001;

• There is a correlation coefficient of r=0.62 (reasonably high)

• A statistical test can be employed that shows that this correlation is in fact significant (p = 0.008)

• What are the flaws?

30

(Reason/Solution on the next slide.)

Pragmatically … A step-by-step how-to

31

(Last slide:) Correlation does not imply causation.

Relevant for M4! :-)

Experimental Procedure: Typical example

• Identify research hypothesis• Specify the design of the study• Think about statistics *before* you run the study• Run a pilot study• Recruit participants• Run the actual data collection sessions• Analyze the data• Report the results

32

Experimental Procedure: Typical example

• Identify research hypothesis• Specify the design of the study• Think about statistics *before* you run the study• Run a pilot study• Recruit participants• Run the actual data collection sessions• Analyze the data• Report the results

33

Run a pilot study

• … to test the study design• … to test the system• … to test the study instruments

34

Recruit participants

• Reflecting the larger population?– in the best case yes– pragmatic decision though

• How many?– Depends on effect size and study design--power of

experiment– Usually 15+ (per group)– Note: much higher than for usability test (~5)

35

Run the actual data collection process

• System and instruments ready?• Greet participants• Introduce purpose of study and procedure

– or deliberately don’t– Don’t bias: “compare my interface vs. this other interface”, …

• Get consent of the participants– ethics!

36

Run the actual data collection process

• Assign participants to specific experiment condition– according to pre-defined randomization method

• Introduction to system(s) and/or training tasks• Participants complete the actual tasks

– take measures of dependent variables• Participants answer questionnaire (if any)• Debriefing session• Payment (if any).

– monetary, coupons, chocolate

37

Report the results

• Introduction / motivation• Study design• Results• Discussion• Conclusions • References / Appendix

• See, for instance, Saul Greenberg’s recommendation:– http://pages.cpsc.ucalgary.ca/~saul/hci_topics/

assignments/controlled_expt/ass1_reports.html38

http://pages.cpsc.ucalgary.ca/~saul/hci_topics/assignments/controlled_expt/ass1_reports.html

http://pages.cpsc.ucalgary.ca/~saul/hci_topics/assignments/controlled_expt/ass1_reports.html

Other, more qualitative Evaluation Methods

39

• Qualitative • Focus: Meaning & experience from participants’ perspectives • Key issue: Relevance ‘why & how’? In-depth, rich description • Approach: more exploratory, open-ended, interpretive; eg

interviews, observations, case studies, focus groups etc

Quantitative? Qualitative?

40

Qualitative Methods as “Add-on”: Mixed Methods Approach

Often controlled experiment +• Experimenter Observations• Collecting Participants Opinions• Core methods: Observation, Semi-structured Interviewing

Helpful for...• Usability Improvement (cf. HCI last weeks) • New insights, explanation of unforeseen results, new questions• Can help to confirm results

41

Qualitative Methods as Primary Method

• Pre-design studies– Rich understanding of a complex domain– Problems, challenges, domain language

• During-, Post-design studies– Case studies/ Field studies

Helpful for...• holistic understanding

42

Qualitative Methods as Primary Method

• In Situ Observations• Participatory Observations• Laboratory Observational Studies• Contextual Interviews• Focus Groups

43

Qualitative Challenges

• Sample Sizes– Doing intensive studies with a lot of

participants?– Time? Data produced?

• Subjectivity– Social relationship?

• Analyzing the data– Grounded theory – Open and axial coding

44

Further Reading Material

Recommended! :)

45

Documents

Title Text - univie.ac.atvda.univie.ac.at/Teaching/HCI/18s/LectureNotes/09_Lab... · 2018-05-15 · • R. Matthews, “Storks Deliver Babies”. Journal of Teaching Statistics, vol