Upload
lynn-hunt
View
223
Download
4
Tags:
Embed Size (px)
Citation preview
Spring 2011 6.813/6.831 User Interface Design and Implementation 1
Lecture 15: Experiment Analysis
UI Hall of Fame or Shame?
Spring 2011 6.813/6.831 User Interface Design and Implementation 2
6.813/6.831 User Interface Design and Implementation 3
Nanoquiz
• closed book, closed notes• submit before time is up (paper or web)• we’ll show a timer for the last 20 seconds
Spring 2011
6.813/6.831 User Interface Design and Implementation 4
1. To maximize external validity, a good research method to use is: (choose one best answer)
A. formative evaluationB. field studyC. surveyD. lab study
2. When deciding between between-subjects and within-subjects designs for an experiment, the most important issues to consider are: (choose all best answers)
A. the setting of the experimentB. ordering effectsC. individual differencesD. tasks
3. Louis Reasoner is running a user study to compare two input devices, and he decides to let participants in his user study decide which device they’ll use in the study. This decision threatens: (choose one best answer)
A. reliabilityB. external validityC. internal validityD. experimenter bias
2019181716151413121110 9 8 7 6 5 4 3 2 1 0
Spring 2011
Today’s Topics
• Hypothesis testing• Graphing with error bars• T test• ANOVA test
Spring 2011 6.813/6.831 User Interface Design and Implementation 5
Experiment Analylsis
• Hypothesis: Mac menubar is faster to access than Windows menubar– Design: between-subjects, randomized
assignment of interface to subject
Spring 2011 6.813/6.831 User Interface Design and Implementation 6
Windows Mac625 647480 503621 559633 586
Statistical Testing
• Compute a statistic summarizing the experimental data
mean(Win)mean(Mac)
• Apply a statistical test– t test: are two means different?– ANOVA (ANalysis Of VAriance): are three or more
means different?• Test produces a p value
– p value = probability that the observed difference happened purely by chance
– If p < 0.05, then we are 95% confident that there is a difference between Windows and Mac
Spring 2011 6.813/6.831 User Interface Design and Implementation 7
Standard Error of the Mean
Spring 2011 6.813/6.831 User Interface Design and Implementation 8
0
50
100
150
200
250
300
350
400
450
500
Windows Mac
0
50
100
150
200
250
300
350
400
450
500
Windows Mac
N = 4 :Error bars overlap, so can’tconclude anything
N=10: Error bars are disjoint, soWindows may be different from Mac
Graphing Techniques
• Pros– Easy to compute– Give a feel for your data
• Cons– Not a substitute for statistical testing
Spring 2011 6.813/6.831 User Interface Design and Implementation 9
0
50
100
150
200
250
300
350
400
450
500
Windows Mac
max
min
25th percentile
75th percentile
median
Windows Max
Error bars Tukey box plots
6.813/6.831 User Interface Design and Implementation 10
Quick Intro to R
• R is an open source programming environment for data manipulation– includes statistics & charting
• Get the data inwin = scan() # win = [625, 480, … ]
mac = scan() # mac = [647, 503, … ]• Compute with it
means = c(mean(win), mean(mac)) # means = [584, 508]
stderrs = c(sd(win)/sqrt(10), sd(mac)/sqrt(10)) # stderrs = [23.29, 26.98]• Graph it
plot = barplot(means, names.arg=c("Windows", "Mac"), ylim=c(0,800))
error.bar(plot, means, stderrs)
Spring 2011
Spring 2011 6.813/6.831 User Interface Design and Implementation 11
Hypothesis Testing
• Our hypothesis: position of menubar matters– i.e., mean(Mac times) < mean(Windows times)– This is called the alternative hypothesis (also called H1)
• If we’re wrong: position of menu bar makes no difference– i.e., mean(Mac) = mean(Win) – This is called the null hypothesis (H0)
• We can’t really disprove the null hypothesis– Instead, we argue that the chance of seeing a difference at
least as extreme as what we saw is very small if the null hypothesis is true
Spring 2011 6.813/6.831 User Interface Design and Implementation 12
Statistical Significance
• Compute a statistic from our experimental dataX = mean(Win) – mean(Mac)
• Determine the probability distribution of the statistic assuming H0 is true
Pr( X=x | H0)• Measure the probability of getting the same or greater difference
Pr ( X > x0 | H0 ) one-sided test 2 Pr ( X > |x0| | H0) two-sided test
• If that probability is less than 5%, then we say– “We reject the null hypothesis at the 5% significance level”– equivalently: “difference between menubars is statistically
significant (p < .05)”• Statistically significant does not mean scientifically important
Spring 2011 6.813/6.831 User Interface Design and Implementation 13
T test
• T test compares the means of two samples A and B
• Two-sided:– H0: mean(A) = mean(B)– H1: mean(A) <> mean(B)
• One-sided:– H0: mean(A) = mean(B)– H1: mean(A) < mean(B)
• Assumptions:– samples A & B are independent (between-
subjects, randomized)– normal distribution– equal variance
Running a T Test (Excel)
Spring 2011 6.813/6.831 User Interface Design and Implementation 14
Win Mac
625 647
480 503
621 559
633 586
Windows MacMean 589.8 573.9Variance 5368.5 3574.0Observations 4 4Pooled Variance 4471.2Hypothesized Mean Difference 0df 6t Stat 0.336P(T<=t) one-tail 0.374t Critical one-tail 1.943P(T<=t) two-tail 0.748t Critical two-tail 2.447
Running a T Test
Spring 2011 6.813/6.831 User Interface Design and Implementation 15
Win Mac
625 647
480 503
621 559
633 586
694 458
599 380
505 477
527 409
651 589
505 472
Windows MacMean 584.0 508.1Variance 5409.3 7295.0Observations 10 10Pooled Variance 6352.2Hypothesized Mean Difference 0df 18t Stat 2.130P(T<=t) one-tail 0.024t Critical one-tail 1.734P(T<=t) two-tail 0.047t Critical two-tail 2.101
6.813/6.831 User Interface Design and Implementation 16
Running a T Test (in R)
• t.test(win, mac)
Spring 2011
6.813/6.831 User Interface Design and Implementation 17
Using Factors in R
• time = c(win,mac)• menubar = factor(c(rep(“win”,10),rep(“mac”,10)))
time = [ 625, 480, …, 647, 503, …]
menubar = [ win, win, …, mac, mac, …]
• t.test(time ~ menubar)
Spring 2011
Spring 2011 6.813/6.831 User Interface Design and Implementation 18
Paired T Test
• For within-subject experiments with two conditions
• Uses the mean of the differences (each user against themselves)
• H0: mean(A_i – B_i) = 0• H1: mean(A_i – B_i) <> 0 (two-sided test)
or mean(A_i – B_i) > 0 (one-sided test)
Reading a Paired T Test
Spring 2011 6.813/6.831 User Interface Design and Implementation 19
Win Mac
625 647
480 503
621 559
633 586
694 458
599 380
505 477
527 409
651 589
505 472
Windows MacMean 584.0 508.1Variance 5409.3 7295.0Observations 10 10Pearson Correlation 0.370Hypothesized Mean Difference 0df 9t Stat 2.675P(T<=t) one-tail 0.013t Critical one-tail 1.833P(T<=t) two-tail 0.025t Critical two-tail 2.262
6.813/6.831 User Interface Design and Implementation 20
Running a Paired T Test (in R)
• t.test(times ~ menubar, paired=TRUE)
Spring 2011
Spring 2011 6.813/6.831 User Interface Design and Implementation 21
Analysis of Variance (ANOVA)
• Compares more than 2 means• One-way ANOVA
– 1 independent variable with k >= 2 levels– H0: all k means are equal– H1: the means are different (so the independent
variable matters)
Running a One-Way ANOVA (Excel)
Spring 2011 6.813/6.831 User Interface Design and Implementation 22
Win Mac Bottom625 647 485480 503 436621 559 512633 586 564694 458 560599 380 587505 477 391527 409 488651 589 555505 472 446
Groups Count Sum Average VarianceWindows 10 5839 584.0 5409.3Mac 10 5080 508.1 7295.0Bottom 10 5023 502.3 4175.8Total 30 15943 531.5 6671.6
Source of Variation SS df MS F P-value F crit
Between Groups 41555 2 20777 3.693 0.038 3.354Within Groups 151920 27 5626
Total 193476 29
6.813/6.831 User Interface Design and Implementation 23
Running ANOVA (in R)
time = [ 625, 480, …, 647, 503, …, 485, 436, …]
menubar = [ win, win, …, mac, mac, …, btm, btm, …]
• fit = aov(time ~ menubar)• summary(fit)
Spring 2011
6.813/6.831 User Interface Design and Implementation 24
Running Within-Subjects ANOVA (in R)
time = [ 625, 480, …, 647, 503, …, 485, 436, …]
menubar = [ win, win, …, mac, mac, …, btm, btm, …]
subject = [ u1, u1, u2, u2, …, u1, u1, u2, u2 …, u1, u1, u2, u2, …]
• fit = aov(time ~ menubar + Error(subject/menubar))• summary(fit)
Spring 2011
Tukey HSD Test
• Tests pairwise differences for significance after a significant ANOVA test– More stringent than multiple pairwise t tests
• Be careful in general about applying multiple statistical tests
Spring 2011 6.813/6.831 User Interface Design and Implementation 25
Windows Mac Bottom460.0
480.0
500.0
520.0
540.0
560.0
580.0
600.0
Win vs. Mac 3.201Mac vs. Bottom 0.242Win vs. Bottom 3.443critical value 3.5
6.813/6.831 User Interface Design and Implementation 26
Tukey HSD Test (in R)
• TukeyHSD(fit)
Spring 2011
Spring 2011 6.813/6.831 User Interface Design and Implementation 27
Two-Way ANOVA
• 2 independent variables with j and k levels, respectively
• Tests whether each variable has an effect independently
• Also tests for interaction between the variables
6.813/6.831 User Interface Design and Implementation 28
Two-way Within-Subjects ANOVA (in R)
time = [ 625, 480, …, 647, 503, …, 485, 436, …]
menubar = [ win, win, …, mac, mac, …, btm, btm, …]
device = [ mouse, pad, …, mouse, pad, …, mouse, pad, …]
subject = [ u1, u1, u2, u2, …, u1, u1, u2, u2 …, u1, u1, u2, u2, …]
• fit = aov(time ~ menubar*device + Error(subject/menubar*device))
• summary(fit)
Spring 2011
Other Tests
• Two discrete-valued variables– “does past experience affect menubar preference?”
• independent var { WinUser, MacUser} • dependent var {PrefersWinMenu, PrefersMacMenu}
– contingency table
– Fisher exact test and chi square test
• Two (or more) scalar variables– Regression
Spring 2011 6.813/6.831 User Interface Design and Implementation 29
PrefersWin PrefersMacWinUser 25 9MacUser 8 19
Tools for Statistical Testing
• Web calculators• Excel• Statistical packages
– Commercial: SAS, SPSS, Stata– Free: R
Spring 2011 6.813/6.831 User Interface Design and Implementation 30
Summary
• Use statistical tests to establish significance of observed differences
• Graphing with error bars is cheap and easy, and great for getting a feel for data
• Use t test to compare two means, ANOVA to compare 3 or more means
Spring 2011 6.813/6.831 User Interface Design and Implementation 31
32
UI Hall of Fame or Shame?
Spring 2011 6.813/6.831 User Interface Design and Implementation