38
Designing Experiments 2019.04.04 Juho Kim

i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Designing Experiments

2019.04.04 Juho Kim

Page 2: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

What makes good research?

✘ Research is creation and communication of knowledge that is○ Novel○ Generalizable○ Valuable○ Valid

2Definition from Krzysztof Gajos

Page 3: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

What makes good research?

✘ Research is creation and communication of knowledge that is○ Novel○ Generalizable○ Valuable○ Valid

3Definition from Krzysztof Gajos

today’s focus

Page 4: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Why run evaluation with humans?

✘ More computing technology is designed to directly help humans.✘ Examples

○ HCI: GUI, mobile, input, output, web, social, AR, VR○ Graphics: video / photo editing tools○ SE: technology for supporting programming tasks○ CV: aesthetics, scene detection, video summarization ○ NLP: translation quality evaluation, framing detection○ Networking/Mobile: sensing technology, quality of service ○ …

4

Page 5: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Research Methods with Humans Involved

5

Page 6: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Controlled Experiments

Page 7: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Showing the success and convincing others

✘ “Do you like my system?”

✘ “How much do you like my system?”

✘ “This is an awesome system. Agree or disagree?”

✘ Huge individual variance, “please the experimenter” bias

Adopted from Scott Klemmer’s slides7

Page 8: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Toward more systematic evaluation

✘ What are the core tasks you’re trying to support?○ Back to earlier stage of research

✘ How can you measure the performance of these tasks?○ Dimensions crucial to your system (e.g., learnability – time to

completion, efficiency – number of steps used, safety – accuracy)

✘ What’s the comparison?○ Useful to have a baseline (control condition without your magic)

Adopted from Scott Klemmer’s slides8

Page 9: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Toward more systematic evaluation

✘ Baserates: How often does Y occur?○ Requires measuring Y.

✘ Correlations: Do X and Y co-vary? ○ Requires measuring X and Y.

✘ Causes: Does X cause Y? ○ Requires measuring X and Y, and manipulating X. ○ Also requires somehow accounting for the effects of other

independent variables (confounds)!

Adopted from Scott Klemmer’s slides9

Page 10: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Is X better than Y?

✘ Most of time, the answer is, “it depends”.

✘ On what condition, is X better than Y?

✘ Controlled comparison enables causal inference.

Adopted from Scott Klemmer’s slides10

Page 11: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Experiment Design 101: Remember Science Class?

✘ Hypothesis: testable, quantifiable, and measurable

✘ Independent variable: what is manipulated to test the hypothesis?○ Different systems, user classes, tasks

✘ Dependent variable: what is measured?○ Time, errors, accuracy, # tasks done, satisfaction○ Statistical tests to accept/reject the hypothesis

11

Page 12: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Some Terms

12

Page 13: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Factors to Consider✘ Users

○ How to sample them? Posting on Ara is sufficient?

✘ Implementation○ Real vs Controlled

✘ Tasks○ Realistic vs Artificial

✘ Measurement○ Capturing dependent variables

✘ Ordering○ Tasks and conditions

✘ Hardware○ What particular devices/machines? 13

Page 14: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Let’s Practice:Mac vs Windows menu bar speed comparison

✘ Hypothesis?

✘ Independent variable?

✘ Dependent variable?

✘ Users

✘ Implementation

✘ Tasks

✘ Measurement

✘ Ordering

✘ Hardware14

Page 15: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Threats toValidity

Page 16: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Concerns Driving Experiment Design

✘ Internal validity: precision○ Are observed results actually caused by IV?

✘ External validity: generalizability○ Can observed results be generalized to the world outside the lab?

✘ Reliability: consistency○ Will consistent results be obtained by repeating the experiment?

16

Page 17: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Threats to Internal Validity

✘ Ordering effects○ Always A before B

✘ Selection bias○ CS majors to use my menubar, non-majors to use baseline

✘ Experimenter bias○ “My interface” vs “Some baseline I should beat”

17

Page 18: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Threats to External Validity

✘ Population○ KAIST CS undergrads to evaluate my menubar○ Selection bias

✘ Ecological○ “In-lab” study of a car dashboard interface

✘ Training○ 1-hour tutorial on the menubar

✘ Task○ Realistic and representative?

18

Page 19: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Threats to Reliability

✘ Uncontrolled variation○ Previous experience○ User differences○ Task design○ Measurement error

19

Page 20: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Potential concerns?

✘ Hypothesis: “box A has a different number of balls from box B .”

✘ IV? DV?✘ Internal validity?✘ External validity?✘ Reliability?

20

Page 21: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Between-Subjects & Within-Subjects

✘ Common user study patterns✘ Between-subjects

○ Users are divided into two groups.○ One IV condition within each group (group A: cond X, group B: cond Y)○ Results are compared between groups.○ Eliminates variation due to ordering effects.

✘ Within-subjects○ Each user sees both conditions (in random order).○ Results are compared within each user.○ Eliminates variation due to user differences.

21

Page 22: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Activity: Analyze a paper’s experiment

bit.ly/i2r-exp1

✘ In pairs, spend 15 mins to analyze the experiment & answer questions.

✘ [Static] vs [Automatic+Dynamic] vs [Customizable+Dynamic] split menus

✘ Skim pages 92-93.

22

Page 23: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Discussion: Analyze a paper’s experimentStudy setup

- Hypothesis / IV / DV?

- Lab study? Field Study? Survey?

- Between-subjects vs Within-subjects? Ordering?

- Users / Implementation / Tasks / Measurement / Hardware

Internal validity- Ordering effects / Selection effects / Experimenter bias?

External validity- Population / Ecological / Training / Task?

Reliability- User differences / Measurement error / Repetition

23

Page 24: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Online

Experimentation

Page 25: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

A/B testing Obama campaign

✘ 500 A/B tests over 20 months

✘ Donation conversion increased by 49%

✘ Sign up conversions increased by 161%

http://kylerush.net/blog/optimization-at-the-obama-campaign-ab-testing/28

Page 26: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

29

Page 27: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

30

Page 28: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

31

Page 29: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Can you tell the difference?

Kohavi et al. "Seven Rules of Thumb for Web Site Experimenters”, KDD 2014. 32

More successful at completing tasks, time-to-success shorter, +$10M annually

Page 30: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Why use online crowds vs lab participants?✘ Reduced cost✘ Faster completion✘ Ease of recruiting✘ Scalability✘ Diversity

✘ Lack of control✘ Lack of expertise✘ Lack of motivation✘ Distraction✘ Quality control

33

Experimental control vs Ecological validity

Page 31: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

http://www.merriam-webster.com/dictionary/crowdsourcing

outcome taskscale undefined crowd

open call

Page 32: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Crowdsourcing Marketplace

Page 33: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

36

Page 34: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

37

Page 35: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Crowdsourcing for Experimentation & Data Collection

✘ Experimentation○ Less WEIRD○ Large scale○ Programmable

✘ Data Collection○ Large scale datasets○ ML, CV, NLP○ Low cost

38

Page 36: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Facebook Emotion Study

“We show, via a massive (N = 689,003) experiment on Facebook, that emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness.”

Kramer et al., PNAS vol. 111 no. 24

✘ A week in 2012✘ Modified News Feed to show more/less emotional posts✘ Seeing more negative posts è more negative posts✘ Seeing more positive posts è more positive posts✘ Seeing less emotional posts è less emotional posts

40

Page 37: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Making Online Experiments Work

✘ Instruction comprehension✘ Environment differences✘ Qualification task✘ Outlier removal✘ Implementation issues✘ Ethical concerns

43

Page 38: i2r fall2019 lecture10 ExperimentDesign 20190404 · 2019-04-04 · leading people to experience the same emotions without their awareness.” Kramer et al., PNAS vol. 111 no. 24 A

Resources

✘ CS374’s Experiment Design reading○ Much of today’s content is from this material.○ https://www.kixlab.org/courses/cs374-spring-

2018/classes/20-Experiment-Design/

44