Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Title Text
Evaluation: Scientific Studies
1
Evaluation Beyond Usability Tests
2
• Expert tests / walkthroughs• Usability Tests with users
• Main goal: formative– identify usability problems– improve the tool
Usability Evaluation (last week)
3
Summative Evaluation (focus today)
• How good is it? Useful?• Better than other tools?
4
Formative and Summative: Usually combined
Evaluation over time
formative summative
5
Evaluation goals (summative)
• Generalizability– Results can be applied to other people
• Precision– We measured what we wanted to measure
(controlling factors that were not intended to study)
• Realism– Study context is realistic
... usually trade-off between them!6
The selection of a research method depends on the research question and the object under study! 7© McGrath / Carpendale
Controlled Experiments
8
Controlled experiment
• Or:– Laboratory Experiment – Lab study – User Study– A/B Testing (used in marketing)– …
9
Focus
• Precision• Generalizability (?)
• Overall goal– Reveal cause-effect relationships– e.g. smoking causes cancer
10
Scenario
A B
Which is better?11
© Carpendale
Test it with users!
12
Hypothesis
• A precise problem statement• Example:
– H1 = Participants will buy more beer when using variant B than variant A
– Null-Hypothesis H0 = no difference in beer purchase
A B
13
Independent Variables
• Factors to be studied• Typical independent variables (in HCI)
– Different types of design– Task type: e.g., searching/browsing– Participant demographics: e.g., male/female – Different technologies: touch pad vs. keyboard
• Control of Independent Variable– Levels: The number of variables in each factor– Limited by the length of the study and the number of
participants• How different?
– Entire interfaces vs. very specific parts
A
B
14
How different?entire interfaces: I can’t tell what actually causes the difference (not so good for research?)very specific parts: expensive (not so good for industry?)
Control Environment
• Make sure nothing else could cause your effect
• Control confounding variables• Randomization!
A
B
15
e.g. naturally observe and see that girls are always using A, and guys Bis it that they buys more beer because of interface A or because they are girlscan’t say anything about causality
Different Designs: Between-Subjects
• Divide the participants into groups, each group does one condition
• Randomize: Group Assignment• Potential problem?
A
B
Group 1
Group 216
Different Designs: Within-Subjects
• Everybody does all the conditions• Can account for individual differences and
reduce noise (that’s why it may be more powerful and requires less participants)
• Severely limits the number of conditions, and even types of tasks tested (may be able to workaround by having multiple sessions)
• Can lead to ordering effects —> Randomize Order
A
B
17
Common: Ordering Effects:* Learning effect: Did everybody use the interface in a certain order? If so, are people faster because they are more practiced, or because of the effect of the interface?* Fatigue / Boredom
Dependent Variable
• The things that you measure• Performance indicators:
– task completion time, error rates, mouse movement…
– (numbers of beers bought)• Subjective participant feedback:
– satisfaction ratings, closed-ended questions, interviews…
– questionnaires (HCI lecture last week)• Observations:
– behaviors, signs of frustrations…18
Tasks
• Specifying good tasks for controlled experiments is tricky– Specifically, if you are measuring performance
criteria• Task criteria
– comparability for different interfaces– clear end point
• Example– usability test: >>buy a book for a 4 year old<<– controlled experiment: >>find and buy the book
Doctor Faustus by Thomas Mann<<19
Results: Application of Statistics
• Descriptive Statistics– Describes the data you gathered (e.g. visually)
• Inferential Statistics– Make predictions/inferences from your study
to the larger population
20
Descriptive statistics
• Central tendency– mean {1, 2, 4, 5} – median {15, 19, 22, 29, 33, 45, 50} – mode {12, 15, 22, 22, 22, 34, 34}
21
Descriptive statistics
• Central tendency– mean {1, 2, 4, 5} 3– median {15, 19, 22, 29, 33, 45, 50} 29– mode {12, 15, 22, 22, 22, 34, 34} 22
22
Descriptive statistics
• Central tendency– mean {1, 2, 4, 5} 3– median {15, 19, 22, 29, 33, 45, 50} 29– mode {12, 15, 22, 22, 22, 34, 34} 22
• Measures of spread– range– variance– standard deviation
note: for inferential standard deviation N becomes (N-1) —> estimate for sampled population23
Visualization of descriptive statistics
• Mean• 25/75% Quartiles• Min / Max• (alternative: with outliers)
e.g., Boxplot
24
Boxplots are not completely standardized - different possible interpretations of Whiskers/Outliers/.. what you are showing
—> Important: Describe what you are visually encoding
Validity
• Errors:– Type I: False positives– Type II: False negatives
• External Validity– Can we generalize the study?– E.g. generalizable to the larger population of
undergrad students• Internal Validity
– Is there a causal relationship?– Are there alternate causes?
25
Inferential statistics
• Goal: Generalize findings to the larger population
http://www.latrobe.edu.au/psy/research/cognitive-and-developmental-psychology/esci 26
Excursus: Tragedy of the error bars
CI = Confidence intervals
SE = Standard Error (SD of the sampling distribution of the sample mean)
SD = Standard Deviation
27
Excursus: 95% Confidence intervals
• USE THEM!• Interpretation: We can be 95% confident
that the real mean lies within our confidence interval!
• More intuition about stats:– Seeing theory:
http://students.brown.edu/seeing-theory/
28
Null Hypothesis Testing
• Statistically significant results– p < .05– The probability that we incorrectly reject the
Null-Hypothesis (Type I error)• Many different tests
– t-test, ANOVA, …
A B
29CI
Internal Validity: Storks deliver babies!?
• R. Matthews, “Storks Deliver Babies”. Journal of Teaching Statistics, vol. 22, issue 2, pages 36-38, 2001;
• There is a correlation coefficient of r=0.62 (reasonably high)
• A statistical test can be employed that shows that this correlation is in fact significant (p = 0.008)
• What are the flaws?
30
(Reason/Solution on the next slide.)
Pragmatically … A step-by-step how-to
31
(Last slide:) Correlation does not imply causation.
Relevant for M4! :-)
Experimental Procedure: Typical example
• Identify research hypothesis• Specify the design of the study• Think about statistics *before* you run the study• Run a pilot study• Recruit participants• Run the actual data collection sessions• Analyze the data• Report the results
32
Experimental Procedure: Typical example
• Identify research hypothesis• Specify the design of the study• Think about statistics *before* you run the study• Run a pilot study• Recruit participants• Run the actual data collection sessions• Analyze the data• Report the results
33
Run a pilot study
• … to test the study design• … to test the system• … to test the study instruments
34
Recruit participants
• Reflecting the larger population?– in the best case yes– pragmatic decision though
• How many?– Depends on effect size and study design--power of
experiment– Usually 15+ (per group)– Note: much higher than for usability test (~5)
35
Run the actual data collection process
• System and instruments ready?• Greet participants• Introduce purpose of study and procedure
– or deliberately don’t– Don’t bias: “compare my interface vs. this other interface”, …
• Get consent of the participants– ethics!
36
Run the actual data collection process
• Assign participants to specific experiment condition– according to pre-defined randomization method
• Introduction to system(s) and/or training tasks• Participants complete the actual tasks
– take measures of dependent variables• Participants answer questionnaire (if any)• Debriefing session• Payment (if any).
– monetary, coupons, chocolate
37
Report the results
• Introduction / motivation• Study design• Results• Discussion• Conclusions • References / Appendix
• See, for instance, Saul Greenberg’s recommendation:– http://pages.cpsc.ucalgary.ca/~saul/hci_topics/
assignments/controlled_expt/ass1_reports.html38
Other, more qualitative Evaluation Methods
39
• Qualitative • Focus: Meaning & experience from participants’ perspectives • Key issue: Relevance ‘why & how’? In-depth, rich description • Approach: more exploratory, open-ended, interpretive; eg
interviews, observations, case studies, focus groups etc
Quantitative? Qualitative?
40
Qualitative Methods as “Add-on”: Mixed Methods Approach
Often controlled experiment +• Experimenter Observations• Collecting Participants Opinions• Core methods: Observation, Semi-structured Interviewing
Helpful for...• Usability Improvement (cf. HCI last weeks) • New insights, explanation of unforeseen results, new questions• Can help to confirm results
41
Qualitative Methods as Primary Method
• Pre-design studies– Rich understanding of a complex domain– Problems, challenges, domain language
• During-, Post-design studies– Case studies/ Field studies
Helpful for...• holistic understanding
42
Qualitative Methods as Primary Method
• In Situ Observations• Participatory Observations• Laboratory Observational Studies• Contextual Interviews• Focus Groups
43
Qualitative Challenges
• Sample Sizes– Doing intensive studies with a lot of
participants?– Time? Data produced?
• Subjectivity– Social relationship?
• Analyzing the data– Grounded theory – Open and axial coding
44
Further Reading Material
Recommended! :)
45