32
Spring 2011 6.813/6.831 User Interface Design and Implementation 1 Lecture 15: Experiment Analysis

Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Embed Size (px)

Citation preview

Page 1: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Spring 2011 6.813/6.831 User Interface Design and Implementation 1

Lecture 15: Experiment Analysis

Page 2: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

UI Hall of Fame or Shame?

Spring 2011 6.813/6.831 User Interface Design and Implementation 2

Page 3: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

6.813/6.831 User Interface Design and Implementation 3

Nanoquiz

• closed book, closed notes• submit before time is up (paper or web)• we’ll show a timer for the last 20 seconds

Spring 2011

Page 4: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

6.813/6.831 User Interface Design and Implementation 4

1. To maximize external validity, a good research method to use is: (choose one best answer)

A. formative evaluationB. field studyC. surveyD. lab study

2. When deciding between between-subjects and within-subjects designs for an experiment, the most important issues to consider are: (choose all best answers)

A. the setting of the experimentB. ordering effectsC. individual differencesD. tasks

3. Louis Reasoner is running a user study to compare two input devices, and he decides to let participants in his user study decide which device they’ll use in the study. This decision threatens: (choose one best answer)

A. reliabilityB. external validityC. internal validityD. experimenter bias

2019181716151413121110 9 8 7 6 5 4 3 2 1 0

Spring 2011

Page 5: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Today’s Topics

• Hypothesis testing• Graphing with error bars• T test• ANOVA test

Spring 2011 6.813/6.831 User Interface Design and Implementation 5

Page 6: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Experiment Analylsis

• Hypothesis: Mac menubar is faster to access than Windows menubar– Design: between-subjects, randomized

assignment of interface to subject

Spring 2011 6.813/6.831 User Interface Design and Implementation 6

Windows Mac625 647480 503621 559633 586

Page 7: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Statistical Testing

• Compute a statistic summarizing the experimental data

mean(Win)mean(Mac)

• Apply a statistical test– t test: are two means different?– ANOVA (ANalysis Of VAriance): are three or more

means different?• Test produces a p value

– p value = probability that the observed difference happened purely by chance

– If p < 0.05, then we are 95% confident that there is a difference between Windows and Mac

Spring 2011 6.813/6.831 User Interface Design and Implementation 7

Page 8: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Standard Error of the Mean

Spring 2011 6.813/6.831 User Interface Design and Implementation 8

0

50

100

150

200

250

300

350

400

450

500

Windows Mac

0

50

100

150

200

250

300

350

400

450

500

Windows Mac

N = 4 :Error bars overlap, so can’tconclude anything

N=10: Error bars are disjoint, soWindows may be different from Mac

Page 9: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Graphing Techniques

• Pros– Easy to compute– Give a feel for your data

• Cons– Not a substitute for statistical testing

Spring 2011 6.813/6.831 User Interface Design and Implementation 9

0

50

100

150

200

250

300

350

400

450

500

Windows Mac

max

min

25th percentile

75th percentile

median

Windows Max

Error bars Tukey box plots

Page 10: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

6.813/6.831 User Interface Design and Implementation 10

Quick Intro to R

• R is an open source programming environment for data manipulation– includes statistics & charting

• Get the data inwin = scan() # win = [625, 480, … ]

mac = scan() # mac = [647, 503, … ]• Compute with it

means = c(mean(win), mean(mac)) # means = [584, 508]

stderrs = c(sd(win)/sqrt(10), sd(mac)/sqrt(10)) # stderrs = [23.29, 26.98]• Graph it

plot = barplot(means, names.arg=c("Windows", "Mac"), ylim=c(0,800))

error.bar(plot, means, stderrs)

Spring 2011

Page 11: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Spring 2011 6.813/6.831 User Interface Design and Implementation 11

Hypothesis Testing

• Our hypothesis: position of menubar matters– i.e., mean(Mac times) < mean(Windows times)– This is called the alternative hypothesis (also called H1)

• If we’re wrong: position of menu bar makes no difference– i.e., mean(Mac) = mean(Win) – This is called the null hypothesis (H0)

• We can’t really disprove the null hypothesis– Instead, we argue that the chance of seeing a difference at

least as extreme as what we saw is very small if the null hypothesis is true

Page 12: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Spring 2011 6.813/6.831 User Interface Design and Implementation 12

Statistical Significance

• Compute a statistic from our experimental dataX = mean(Win) – mean(Mac)

• Determine the probability distribution of the statistic assuming H0 is true

Pr( X=x | H0)• Measure the probability of getting the same or greater difference

Pr ( X > x0 | H0 ) one-sided test 2 Pr ( X > |x0| | H0) two-sided test

• If that probability is less than 5%, then we say– “We reject the null hypothesis at the 5% significance level”– equivalently: “difference between menubars is statistically

significant (p < .05)”• Statistically significant does not mean scientifically important

Page 13: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Spring 2011 6.813/6.831 User Interface Design and Implementation 13

T test

• T test compares the means of two samples A and B

• Two-sided:– H0: mean(A) = mean(B)– H1: mean(A) <> mean(B)

• One-sided:– H0: mean(A) = mean(B)– H1: mean(A) < mean(B)

• Assumptions:– samples A & B are independent (between-

subjects, randomized)– normal distribution– equal variance

Page 14: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Running a T Test (Excel)

Spring 2011 6.813/6.831 User Interface Design and Implementation 14

Win Mac

625 647

480 503

621 559

633 586

  Windows MacMean 589.8 573.9Variance 5368.5 3574.0Observations 4 4Pooled Variance 4471.2Hypothesized Mean Difference 0df 6t Stat 0.336P(T<=t) one-tail 0.374t Critical one-tail 1.943P(T<=t) two-tail 0.748t Critical two-tail 2.447

Page 15: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Running a T Test

Spring 2011 6.813/6.831 User Interface Design and Implementation 15

Win Mac

625 647

480 503

621 559

633 586

694 458

599 380

505 477

527 409

651 589

505 472

  Windows MacMean 584.0 508.1Variance 5409.3 7295.0Observations 10 10Pooled Variance 6352.2Hypothesized Mean Difference 0df 18t Stat 2.130P(T<=t) one-tail 0.024t Critical one-tail 1.734P(T<=t) two-tail 0.047t Critical two-tail 2.101

Page 16: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

6.813/6.831 User Interface Design and Implementation 16

Running a T Test (in R)

• t.test(win, mac)

Spring 2011

Page 17: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

6.813/6.831 User Interface Design and Implementation 17

Using Factors in R

• time = c(win,mac)• menubar = factor(c(rep(“win”,10),rep(“mac”,10)))

time = [ 625, 480, …, 647, 503, …]

menubar = [ win, win, …, mac, mac, …]

• t.test(time ~ menubar)

Spring 2011

Page 18: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Spring 2011 6.813/6.831 User Interface Design and Implementation 18

Paired T Test

• For within-subject experiments with two conditions

• Uses the mean of the differences (each user against themselves)

• H0: mean(A_i – B_i) = 0• H1: mean(A_i – B_i) <> 0 (two-sided test)

or mean(A_i – B_i) > 0 (one-sided test)

Page 19: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Reading a Paired T Test

Spring 2011 6.813/6.831 User Interface Design and Implementation 19

Win Mac

625 647

480 503

621 559

633 586

694 458

599 380

505 477

527 409

651 589

505 472

  Windows MacMean 584.0 508.1Variance 5409.3 7295.0Observations 10 10Pearson Correlation 0.370Hypothesized Mean Difference 0df 9t Stat 2.675P(T<=t) one-tail 0.013t Critical one-tail 1.833P(T<=t) two-tail 0.025t Critical two-tail 2.262

Page 20: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

6.813/6.831 User Interface Design and Implementation 20

Running a Paired T Test (in R)

• t.test(times ~ menubar, paired=TRUE)

Spring 2011

Page 21: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Spring 2011 6.813/6.831 User Interface Design and Implementation 21

Analysis of Variance (ANOVA)

• Compares more than 2 means• One-way ANOVA

– 1 independent variable with k >= 2 levels– H0: all k means are equal– H1: the means are different (so the independent

variable matters)

Page 22: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Running a One-Way ANOVA (Excel)

Spring 2011 6.813/6.831 User Interface Design and Implementation 22

Win Mac Bottom625 647 485480 503 436621 559 512633 586 564694 458 560599 380 587505 477 391527 409 488651 589 555505 472 446

Groups Count Sum Average VarianceWindows 10 5839 584.0 5409.3Mac 10 5080 508.1 7295.0Bottom 10 5023 502.3 4175.8Total 30 15943 531.5 6671.6

Source of Variation SS df MS F P-value F crit

Between Groups 41555 2 20777 3.693 0.038 3.354Within Groups 151920 27 5626

Total 193476 29

Page 23: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

6.813/6.831 User Interface Design and Implementation 23

Running ANOVA (in R)

time = [ 625, 480, …, 647, 503, …, 485, 436, …]

menubar = [ win, win, …, mac, mac, …, btm, btm, …]

• fit = aov(time ~ menubar)• summary(fit)

Spring 2011

Page 24: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

6.813/6.831 User Interface Design and Implementation 24

Running Within-Subjects ANOVA (in R)

time = [ 625, 480, …, 647, 503, …, 485, 436, …]

menubar = [ win, win, …, mac, mac, …, btm, btm, …]

subject = [ u1, u1, u2, u2, …, u1, u1, u2, u2 …, u1, u1, u2, u2, …]

• fit = aov(time ~ menubar + Error(subject/menubar))• summary(fit)

Spring 2011

Page 25: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Tukey HSD Test

• Tests pairwise differences for significance after a significant ANOVA test– More stringent than multiple pairwise t tests

• Be careful in general about applying multiple statistical tests

Spring 2011 6.813/6.831 User Interface Design and Implementation 25

Windows Mac Bottom460.0

480.0

500.0

520.0

540.0

560.0

580.0

600.0

Win vs. Mac 3.201Mac vs. Bottom 0.242Win vs. Bottom 3.443critical value 3.5

Page 26: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

6.813/6.831 User Interface Design and Implementation 26

Tukey HSD Test (in R)

• TukeyHSD(fit)

Spring 2011

Page 27: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Spring 2011 6.813/6.831 User Interface Design and Implementation 27

Two-Way ANOVA

• 2 independent variables with j and k levels, respectively

• Tests whether each variable has an effect independently

• Also tests for interaction between the variables

Page 28: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

6.813/6.831 User Interface Design and Implementation 28

Two-way Within-Subjects ANOVA (in R)

time = [ 625, 480, …, 647, 503, …, 485, 436, …]

menubar = [ win, win, …, mac, mac, …, btm, btm, …]

device = [ mouse, pad, …, mouse, pad, …, mouse, pad, …]

subject = [ u1, u1, u2, u2, …, u1, u1, u2, u2 …, u1, u1, u2, u2, …]

• fit = aov(time ~ menubar*device + Error(subject/menubar*device))

• summary(fit)

Spring 2011

Page 29: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Other Tests

• Two discrete-valued variables– “does past experience affect menubar preference?”

• independent var { WinUser, MacUser} • dependent var {PrefersWinMenu, PrefersMacMenu}

– contingency table

– Fisher exact test and chi square test

• Two (or more) scalar variables– Regression

Spring 2011 6.813/6.831 User Interface Design and Implementation 29

PrefersWin PrefersMacWinUser 25 9MacUser 8 19

Page 30: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Tools for Statistical Testing

• Web calculators• Excel• Statistical packages

– Commercial: SAS, SPSS, Stata– Free: R

Spring 2011 6.813/6.831 User Interface Design and Implementation 30

Page 31: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

Summary

• Use statistical tests to establish significance of observed differences

• Graphing with error bars is cheap and easy, and great for getting a feel for data

• Use t test to compare two means, ANOVA to compare 3 or more means

Spring 2011 6.813/6.831 User Interface Design and Implementation 31

Page 32: Spring 20116.813/6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis

32

UI Hall of Fame or Shame?

Spring 2011 6.813/6.831 User Interface Design and Implementation