Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Spring 2011 6.813/6.831 User Interface Design and Implementation 1

Lecture 15: Experiment Analysis

UI Hall of Fame or Shame?


6.813/6.831 User Interface Design and Implementation 3

Nanoquiz

• closed book, closed notes• submit before time is up (paper or web)• we’ll show a timer for the last 20 seconds

Spring 2011


1. To maximize external validity, a good research method to use is: (choose one best answer)

A. formative evaluationB. field studyC. surveyD. lab study

2. When deciding between between-subjects and within-subjects designs for an experiment, the most important issues to consider are: (choose all best answers)

A. the setting of the experimentB. ordering effectsC. individual differencesD. tasks

3. Louis Reasoner is running a user study to compare two input devices, and he decides to let participants in his user study decide which device they’ll use in the study. This decision threatens: (choose one best answer)

A. reliabilityB. external validityC. internal validityD. experimenter bias

2019181716151413121110 9 8 7 6 5 4 3 2 1 0

Spring 2011

Today’s Topics

• Hypothesis testing• Graphing with error bars• T test• ANOVA test


Experiment Analylsis

• Hypothesis: Mac menubar is faster to access than Windows menubar– Design: between-subjects, randomized

assignment of interface to subject


Windows Mac625 647480 503621 559633 586

Statistical Testing

• Compute a statistic summarizing the experimental data

mean(Win)mean(Mac)

• Apply a statistical test– t test: are two means different?– ANOVA (ANalysis Of VAriance): are three or more

means different?• Test produces a p value

– p value = probability that the observed difference happened purely by chance

– If p < 0.05, then we are 95% confident that there is a difference between Windows and Mac


Standard Error of the Mean


0

50

100

150

200

250

300

350

400

450

500

Windows Mac

0

50

100

150

200

250

300

350

400

450

500

Windows Mac

N = 4 :Error bars overlap, so can’tconclude anything

N=10: Error bars are disjoint, soWindows may be different from Mac

Graphing Techniques

• Pros– Easy to compute– Give a feel for your data

• Cons– Not a substitute for statistical testing


0

50

100

150

200

250

300

350

400

450

500

Windows Mac

max

min

25th percentile

75th percentile

median

Windows Max

Error bars Tukey box plots


Quick Intro to R

• R is an open source programming environment for data manipulation– includes statistics & charting

• Get the data inwin = scan() # win = [625, 480, … ]

mac = scan() # mac = [647, 503, … ]• Compute with it

means = c(mean(win), mean(mac)) # means = [584, 508]

stderrs = c(sd(win)/sqrt(10), sd(mac)/sqrt(10)) # stderrs = [23.29, 26.98]• Graph it

plot = barplot(means, names.arg=c("Windows", "Mac"), ylim=c(0,800))

error.bar(plot, means, stderrs)

Spring 2011


Hypothesis Testing

• Our hypothesis: position of menubar matters– i.e., mean(Mac times) < mean(Windows times)– This is called the alternative hypothesis (also called H1)

• If we’re wrong: position of menu bar makes no difference– i.e., mean(Mac) = mean(Win) – This is called the null hypothesis (H0)

• We can’t really disprove the null hypothesis– Instead, we argue that the chance of seeing a difference at

least as extreme as what we saw is very small if the null hypothesis is true


Statistical Significance

• Compute a statistic from our experimental dataX = mean(Win) – mean(Mac)

• Determine the probability distribution of the statistic assuming H0 is true

Pr( X=x | H0)• Measure the probability of getting the same or greater difference

Pr ( X > x0 | H0 ) one-sided test 2 Pr ( X > |x0| | H0) two-sided test

• If that probability is less than 5%, then we say– “We reject the null hypothesis at the 5% significance level”– equivalently: “difference between menubars is statistically

significant (p < .05)”• Statistically significant does not mean scientifically important


T test

• T test compares the means of two samples A and B

• Two-sided:– H0: mean(A) = mean(B)– H1: mean(A) <> mean(B)

• One-sided:– H0: mean(A) = mean(B)– H1: mean(A) < mean(B)

• Assumptions:– samples A & B are independent (between-

subjects, randomized)– normal distribution– equal variance

Running a T Test (Excel)


Win Mac

625 647

480 503

621 559

633 586

Windows MacMean 589.8 573.9Variance 5368.5 3574.0Observations 4 4Pooled Variance 4471.2Hypothesized Mean Difference 0df 6t Stat 0.336P(T<=t) one-tail 0.374t Critical one-tail 1.943P(T<=t) two-tail 0.748t Critical two-tail 2.447

Running a T Test


Win Mac

625 647

480 503

621 559

633 586

694 458

599 380

505 477

527 409

651 589

505 472

Windows MacMean 584.0 508.1Variance 5409.3 7295.0Observations 10 10Pooled Variance 6352.2Hypothesized Mean Difference 0df 18t Stat 2.130P(T<=t) one-tail 0.024t Critical one-tail 1.734P(T<=t) two-tail 0.047t Critical two-tail 2.101


Running a T Test (in R)

• t.test(win, mac)

Spring 2011


Using Factors in R

• time = c(win,mac)• menubar = factor(c(rep(“win”,10),rep(“mac”,10)))

time = [ 625, 480, …, 647, 503, …]

menubar = [ win, win, …, mac, mac, …]

• t.test(time ~ menubar)

Spring 2011


Paired T Test

• For within-subject experiments with two conditions

• Uses the mean of the differences (each user against themselves)

• H0: mean(A_i – B_i) = 0• H1: mean(A_i – B_i) <> 0 (two-sided test)

or mean(A_i – B_i) > 0 (one-sided test)

Reading a Paired T Test


Win Mac

625 647

480 503

621 559

633 586

694 458

599 380

505 477

527 409

651 589

505 472

Windows MacMean 584.0 508.1Variance 5409.3 7295.0Observations 10 10Pearson Correlation 0.370Hypothesized Mean Difference 0df 9t Stat 2.675P(T<=t) one-tail 0.013t Critical one-tail 1.833P(T<=t) two-tail 0.025t Critical two-tail 2.262


Running a Paired T Test (in R)

• t.test(times ~ menubar, paired=TRUE)

Spring 2011


Analysis of Variance (ANOVA)

• Compares more than 2 means• One-way ANOVA

– 1 independent variable with k >= 2 levels– H0: all k means are equal– H1: the means are different (so the independent

variable matters)

Running a One-Way ANOVA (Excel)


Win Mac Bottom625 647 485480 503 436621 559 512633 586 564694 458 560599 380 587505 477 391527 409 488651 589 555505 472 446

Groups Count Sum Average VarianceWindows 10 5839 584.0 5409.3Mac 10 5080 508.1 7295.0Bottom 10 5023 502.3 4175.8Total 30 15943 531.5 6671.6

Source of Variation SS df MS F P-value F crit

Between Groups 41555 2 20777 3.693 0.038 3.354Within Groups 151920 27 5626

Total 193476 29


Running ANOVA (in R)

time = [ 625, 480, …, 647, 503, …, 485, 436, …]

menubar = [ win, win, …, mac, mac, …, btm, btm, …]

• fit = aov(time ~ menubar)• summary(fit)

Spring 2011


Running Within-Subjects ANOVA (in R)

time = [ 625, 480, …, 647, 503, …, 485, 436, …]


subject = [ u1, u1, u2, u2, …, u1, u1, u2, u2 …, u1, u1, u2, u2, …]

• fit = aov(time ~ menubar + Error(subject/menubar))• summary(fit)

Spring 2011

Tukey HSD Test

• Tests pairwise differences for significance after a significant ANOVA test– More stringent than multiple pairwise t tests

• Be careful in general about applying multiple statistical tests


Windows Mac Bottom460.0

480.0

500.0

520.0

540.0

560.0

580.0

600.0

Win vs. Mac 3.201Mac vs. Bottom 0.242Win vs. Bottom 3.443critical value 3.5


Tukey HSD Test (in R)

• TukeyHSD(fit)

Spring 2011


Two-Way ANOVA

• 2 independent variables with j and k levels, respectively

• Tests whether each variable has an effect independently

• Also tests for interaction between the variables


Two-way Within-Subjects ANOVA (in R)

time = [ 625, 480, …, 647, 503, …, 485, 436, …]


device = [ mouse, pad, …, mouse, pad, …, mouse, pad, …]

subject = [ u1, u1, u2, u2, …, u1, u1, u2, u2 …, u1, u1, u2, u2, …]

• fit = aov(time ~ menubar*device + Error(subject/menubar*device))

• summary(fit)

Spring 2011

Other Tests

• Two discrete-valued variables– “does past experience affect menubar preference?”

• independent var { WinUser, MacUser} • dependent var {PrefersWinMenu, PrefersMacMenu}

– contingency table

– Fisher exact test and chi square test

• Two (or more) scalar variables– Regression


PrefersWin PrefersMacWinUser 25 9MacUser 8 19

Tools for Statistical Testing

• Web calculators• Excel• Statistical packages

– Commercial: SAS, SPSS, Stata– Free: R


Summary

• Use statistical tests to establish significance of observed differences

• Graphing with error bars is cheap and easy, and great for getting a feel for data

• Use t test to compare two means, ANOVA to compare 3 or more means


32

UI Hall of Fame or Shame?

Spring 2011 6.813/6.831 User Interface Design and Implementation

Documents

Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis