71
1 Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

1

Quantitative analysis with statistics (and ponies)

(Some slides, pony-based examples from Blase Ur)

Page 2: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

2

•  Interviews, diary studies

•  Start stats

•  Thursday: Ethics/IRB

•  Tuesday: More stats

•  New homework is available

Page 3: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

3

INTERVIEWS

Page 4: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

4

Why an interview

•  Rich data (from fewer people)

•  Good for exploration

–  When you aren’t sure what you’ll find–  Helps identify themes, gain new perspectives

•  Usually cannot generalize quantitatively

•  Potential for bias (conducting, analyzing)

•  Structured vs. semi-structured

Page 5: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

5

Interview best practices

•  Make participants comfortable

•  Avoid leading questions

•  Support whatever participants say–  Don’t make them feel incorrect or stupid

•  Know when to ask a follow-up

•  Get a broad range of participants (hard)

Page 6: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

6

Try it!

•  In pairs, write two interview questions about password security/usability

•  Change partners with another pair and ask each other; report back

Page 7: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

7

DIARY STUDIES

Page 8: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

8

Why do a diary study?

•  Rich longitudinal data (from a few participants)–  In the field … ish

•  Natural reactions and occurences

–  Existence and quantity of phenomena–  User reactions in the moment rather than via recall

•  Lots of work for you and your participants

•  On paper vs. technology-mediated

Page 9: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

9

Experience sampling

•  Kind of a prompted diary

•  Send participants a stimulus when they are in their natural life, not in the lab

Page 10: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

10

Diary / ESM best practices

•  When will an entry be recorded?–  How often? Over what time period?

•  How long will it take to record an entry?

–  How structured is the response?

•  Pay well

–  Pay per response? But don’t create bias

Page 11: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

11

Facebook regrets (Wang et al.)

•  Online survey, interviews, diary study, 2nd survey

•  What do people regret posting? Why?

•  How do users mitigate?

Page 12: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

12

FB regrets – Interviews

•  Semi-structured, in-person, in-lab

•  Recruiting via Craigslist

–  Why pre-screen questionnaire?–  19/301

•  Coded by a single author for high-level themes

Page 13: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

13

FB regrets – Diary study

•  “The diary study did not turn out to be very useful”

•  Daily online form (30 days)–  Facebook activities, incidents–  “Have you changed anything in your privacy settings?

What and why?”–  “Have you posted something on Facebook and then

regretted doing it? Why and what happened?”–  22+ days of entries: $15–  12/19 interviewees entered 1+ logs (217 total logs)

Page 14: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

14

Location-sharing (Consolvo et al.)

•  Whether and what about location to disclose–  To people you know

•  Preliminary interview

–  Buddy list, expected preferences

•  Two-week ESM (simulated location requests)

•  Final interview to reflect on experience

Page 15: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

15

ESM study

•  Whether to disclose or not, and why–  Customized askers, customized context questions–  If so, how granular?–  Where are you and what are you doing?–  One-time or standing request

•  $60-$250 to maximize participation

•  Average response rate: above 90%

Page 16: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

16

Statistics for experimental comparisons

•  The main idea: Hypothesis testing

•  Choosing the right test: Comparisons

•  Regressions

•  Other stuff

–  Non-independence, directional tests, effect size

•  Tools

Page 17: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

17

OVERVIEW What’s the big idea, anyway?

Page 18: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

18

Statistics

•  In general: analyzing and interpreting data

•  We often mean: Statistical hypothesis testing–  Question: Are two things different?–  Is it unlikely the data would look like this unless there

is actually a difference in real life?

Page 19: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

19

Important note

•  This lecture is not going to be precise or complete. It is intended to give you some intuition and help you understand what questions to ask.

Page 20: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

20

The prototypical case

•  Q: Do ponies who drink more caffeine make better passwords?

•  Experiment: Recruit 30 ponies. Give 15 caffeine pills and 15 placebos. They all create passwords.

http://www.fanpop.com/clubs/my-little-pony-friendship-is-magic/images/33207334/title/little-pony-friendship-magic-photo

Page 21: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

21

Hypotheses

•  Null hypothesis: There is no difference

Caffeine does not affect pony password strength.

•  Alternative hypothesis: There is a difference

Caffeine affects pony password strength.

•  Note what is not here (more on this later):–  Which direction is the effect? –  How strong is the effect?

Page 22: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

22

Hypotheses, continued

•  Statistical test gives you one of two answers:1.  Reject the null: We have (strong) evidence the

alternative is true.2.  Don’t reject the null: We don’t have (strong)

evidence the alternative is true.

•  Again, note what isn’t here:

–  We have strong evidence the null is true. (NOPE)

Page 23: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

23

P values •  What is the probability that the data would look

like this if there’s no actual difference?

–  i.e., Probability we tell everyone about ponies and caffeine but it isn’t really true

•  Most often, α = 0.05; some people choose 0.01

–  If p < 0.05 , reject null hypothesis; there is a “significant” difference between caffeine and placebo

–  A p-value is not magic, just probability, and the threshold is arbitrary

–  But, reported TRUE or FALSE: You don’t say something is “more significant” because the p-value is lower

Page 24: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

24

Type II Error (False negative)

•  There is a difference, but you didn’t find evidence–  No one will know the power of caffeinated ponies

•  Hypothesis tests DO NOT BOUND this error

•  Instead, statistical power is the probability of rejecting the null hypothesis if you should–  Requires that you estimate the effect size (hard)

Page 25: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

25

•  After an experiment, one of four things has happened (total P=1).

•  Which box are you in? You don’t know.

Hypotheses, power, probability

PROBABILITY You rejected the null You didn’t Reality: Difference Estimated via power analysis ? Reality: No difference Bounded by α ?

Page 26: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

26

Correlation and causation

•  Correlation: We observe that two things are related

Do rural or urban ponies make stronger passwords?

•  Causation: We randomly assigned participants to groups and gave them different treatments

–  If designed properlyDo password meters help ponies?

Page 27: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

27

CHOOSING THE RIGHT TEST

Page 28: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

28

What kind of data do you have?

•  Explanatory variables: inputs, x-values –  e.g., conditions, demographics

•  Outcome variables: outputs, y-values

–  e.g., time taken, Likert responses, password strength

Page 29: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

29

http

://i1

96.p

hoto

buck

et.c

om/a

lbum

s/aa

92/

karin

a408

_alb

um/W

allp

aper

-53.

jpg

What kind of data do you have?

•  Quantitative–  Discrete (Number of caffeine pills taken by each pony)–  Continuous (Weight of each pony)

•  Categorical–  Binary (Is it or isn’t it a pony?)–  Nominal: No order (Color of the pony)–  Ordinal: Ordered (Is the pony super cool,

cool, a little cool, or uncool)

Page 30: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

30

What kind of data do you have?

•  Does your dependent data follow a normal distribution? (You can calculate this!)

–  If so, use parametric tests. –  If not, use non-parametric tests.

•  Are your data independent?–  If not, repeated-measures, mixed models, etc.

http://www.wikipedia.org

Page 31: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

31

If both are categorical ….

•  Participants each used one of two systems–  Did they like the system they got? (Yes/no)

•  HA: System affects user sentiment

•  Use (Pearson’s) χ2 (Chi-squared) test of independence.

–  Fewer than 5 data points in any single cell, use Fisher’s Exact Test (also works with lots of data)

Page 32: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

32

Contingency tables

•  Rows one variable, columns the other

•  Example: –  Row = condition–  Column = true/false

•  χ2 = 97.013, df = 14, p = 1.767e-14

Page 33: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

33

Explanatory: categorical Outcome: continuous …. •  Participants each used one system

–  Measure a continuous value (time taken, pwd guess #)

•  HA: System affects password strength

•  Normal, continuous outcome (compare mean):

–  2 conditions: T-test –  3+ conditions: ANOVA

Page 34: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

34

Explanatory: categorical Outcome: continuous …. •  Non-normal outcome, ordinal outcome

–  Does one group tend to have larger values?–  2 conditions: Mann-Whitney U (AKA Wilcoxon rank-

sum)–  3+ conditions: Kruskal-Wallis

Page 35: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

35

Outcome: Length of password

Page 36: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

36

What about Likert-scale data?

•  Respond to the statement: Ponies are magical.–  7: Strongly agree–  6: Agree–  5: Mildly agree–  4: Neutral–  3: Mildly disagree–  2: Disagree–  1: Strongly disagree

Page 37: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

37

What about Likert-scale data?

•  Some people treat it as continuous (not good)

•  Other people treat it as ordinal (better!)

–  Difference 1-2 ≠ 2-3–  Use Mann-Whitney U / Kruskal-Wallis

•  Another good option: binning (simpler)

–  Transform into binary “agree” and “not agree”–  Use χ2 or FET

Page 38: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

38

nudge-comp8

38

baseline meter

three-segment green

tiny huge

no suggestions text-only

bunny

half-score one-third-score

nudge-16

text-only half-score bold text-only half-

score

Visual

Scoring

Visual & Scoring

Control Password meter annoying

Page 39: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

39

Notes for study design

•  Plan your analysis before you collect data!–  What explanatory, outcome variables?–  Which tests will be appropriate?

•  Ensure that you collect what you need and know what do with it

–  Otherwise your experiment may be wasted

Page 40: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

40

CONTRASTS

Page 41: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

41

Contrasts

•  If you have more than two conditions,–  H0 = “the conditions are all the same” –  HA = “the conditions are not all the same”–  “Omnibus test”

•  If you accept the null, you are done

•  ONLY if you reject this null, you may compare individual conditions to each other–  AKA “Pairwise”

Page 42: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

42

Example:

•  Password meters: 15 conditions–  Does assigned meter affect password strength?–  Omnibus test: yes–  Individual meter: Better than no meter?–  One meter better than another meter?

Page 43: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

43

P values and multiple testing

•  P-values bound Type I error (false positive)–  You expect this to happen 5% of the time if α = 0.05

•  What happens if you conduct a lot of statistical tests in one experiment?

•  Your cumulative probability of a Type I error can increase dramatically!

Page 44: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

44

Correcting p-values

•  Goal: Adjust the math so your overall Type I error remains bounded by α = 0.05

•  Many methods for “correcting” p values–  Bonferroni correction: Easy but conservative (Multiply p values by the number of tests)–  Holm-Bonferroni is also frequently used

Page 45: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

45

Planned vs. Unplanned Contrasts

•  N-1 free planned contrasts–  Actually, really planned. No peeking at the data.

•  Additional contrasts (planned or unplanned) require p-correction for multiple testing

Page 46: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

46

Contrasts in the meters paper

“We ran pairwise contrasts comparing each condition to our two control conditions, no meter and baseline meter. In addition, to investigate hypotheses about the ways in which conditions varied, we ran planned contrasts comparing tiny to huge, nudge-16 to nudge-comp8, half-score to one-third-score, text-only to text-only half-score, half-score to text-only half-score, and text-only half-score to bold text-only half-score.”

Page 47: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

47

Continuous/ordinal data

Page 48: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

48

Notes for study design

•  Lots of conditions means lots of correction–  Which means you need big effect sizes or large N

•  Consider limiting conditions

–  What do you really want to test?–  Full-factorial or not?

Page 49: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

49

CORRELATION, REGRESSION Finding a relationship among variables

Page 50: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

50

Correlation

•  Measure two numeric values–  Are they related?

•  Pearson correlation

–  Requires both variables to be normal–  Only looks for a linear relationship

•  Often preferred: Spearman’s rank correlation coefficient (Spearman’s ρ)–  Evaluates a relationship’s monotonicity–  Both variables get larger together

Page 51: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

51

Regressions

•  What is the relationship among variables?–  Generally one outcome (dependent variable)–  Often multiple factors (independent variables)

•  The type of regression you perform depends on the outcome

–  Binary outcome: logistic regression–  Ordinal outcome: ordinal / ordered regression–  Continuous outcome: linear regression

Page 52: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

52

Example regression

•  Outcome: –  Pass pony quiz (or not): Logistic–  Total score on pony quiz: Linear

•  Independent variables: –  Age of pony–  Number of prior races–  Diet: hay or pop-tarts (code as eatsHay=true/false)–  (Indicator variables for color categories)–  Etc.

Page 53: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

53

What you get

•  Linear: Outcome = ax1 + bx2 + c–  Score = 5*eatsHay - 3*age + 7

•  Logistic: Outcome is in log likelihood–  Intuition: probability of passing decreases with age,

increases if ate hay, etc.

Page 54: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

54

Interactions in a regression

•  Normally, outcome = ax1 + bx2 + c + …

•  Interactions account for situations when two variables are not simply additive. Instead, their interaction impacts the outcome

–  e.g., Maybe blue ponies, and only blue ponies, get a larger benefit from eating pop-tarts before the quiz

•  Outcome = ax1 + bx2 + c + d(x1x2) + …

Page 55: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

55

Example regression output

Page 56: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

56

Notes for study design

•  The more input variables in your regression, the more data you will need to collect to get useful results

Page 57: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

57

Try it! In groups of 2-3

•  Does caffeine impact pony password strength?–  When strength = cracked or not cracked–  When strength = 0-100 scoring–  When strength = self-reported perception 1-5–  Compare caffeine, NyQuil, placebo

•  Do gender, state of residence, and education level impact pony password strength?

Page 58: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

58

OTHER THINGS TO CONSIDER Non-independence, directional testing, effect size

Page 59: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

59

What if you have lots of questions?

•  If we ask 40 privacy questions on a Likert scale, how do we analyze this survey?

•  One option: Add responses to get “privacy score”–  Make sure the scales are the same –  Reverse if needed (e.g., “personal privacy is important

to me” “I don’t care if companies sell my data”)–  Important: Verify that responses are correlated!

Page 60: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

60

Verifying correlation

•  Usually preferred: Spearman’s rank correlation coefficient (Spearman’s ρ)–  Evaluates a relationship’s monotonicity–  e.g., all variables get larger with privacy sensitivity

Page 61: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

61

Another option: Factor analysis

•  Evaluate underlying factors you are detecting

•  You specify N, a number of factors

•  Algorithm groups related questions (N groups)–  Each group is a factor

•  Factor loadings measure goodness of correlation

–  Questions loading primarily onto one factor are useful

Page 62: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

62

In groups: Plan your analysis

•  Does caffeine impact pony password strength?–  When strength = cracked or not cracked–  When strength = 0-100 scoring–  Compare caffeine, NyQuil, placebo

•  Do gender, age, state of residence, and education level impact pony privacy concern?

–  Concerned vs. unconcerned–  Privacy “score” by adding 30 questions

Page 63: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

63

Independence

•  Why might your data not be independent?–  Non-independent sample (bad!)–  The inherent design of the experiment (ok!)

•  Example: Same ponies make passwords, before and after taking the caffeine pills

–  Each pony cannot be independent of itself

Page 64: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

64

Repeated measures

•  AKA within subjects–  Measure the same participant multiple times

•  Paired T-test

–  Two samples per participant, two groups

•  Repeated measures ANOVA

–  More general

Page 65: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

65

Hierarchy and mixed model

•  For regressions, use a “mixed model”

•  Intuition: Each pony’s result driven by combo of individual skills, group characteristics, treatment effects

•  Case 1: Many measurements of each pony

•  Case 2: The ponies have some other relationship. e.g., all ponies attended 1 of 5 security camps. (You want to control for this, but not evaluate it.)

Page 66: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

66

Directional testing

•  If your hypothesis goes one way:

Caffeinated ponies make stronger passwords.

•  More power than more general tests–  BUT, must select direction BEFORE looking at data–  Won’t reject null if there’s a difference the other way

•  Example: One-tailed T-test

•  Use with caution!

Page 67: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

67

Effect size

•  Hypothesis test: Is there a difference?

•  Also (more?) important: How big a difference?

•  Findings can be “significant” but unimportant

Factor Coef. Exp(coef) SE p-valuenumber of digits -0.309 0.734 0.011 <0.001number of lowercase -0.349 0.705 0.085 <0.001number of uppercase -0.391 0.676 0.099 <0.001number of symbols -0.632 0.531 0.037 <0.001digits in middle -0.130 0.878 0.296 0.660†digits spread out -1.569 0.208 0.294 <0.001digits at beginning 0.419 1.520 0.304 0.168†uppercase in middle -0.006 0.994 0.158 0.970†uppercase spread out 0.540 1.717 0.175 0.002uppercase at beginning 0.854 2.349 0.160 <0.001symbols in middle -0.319 0.727 0.296 0.281†symbols spread out -1.403 0.246 0.339 <0.001symbols at beginning 0.425 1.530 0.296 0.151†gender (male) 0.007 1.007 0.023 0.773†birth year 0.007 1.007 0.001 <0.001engineering -0.137 0.872 0.042 0.001humanities -0.071 0.931 0.049 0.144†public policy 0.032 1.033 0.051 0.530†science -0.170 0.844 0.055 0.002other -0.081 0.922 0.046 0.079†computer science -0.193 0.825 0.048 <0.001business 0.167 1.182 0.049 <0.001(# dig.:# lower.) 0.032 1.032 0.004 <0.001(# lower.:dig. middle) -0.110 0.896 0.027 <0.001(# lower.:dig. spread) -0.237 0.789 0.035 <0.001(# lower.:dig. begin.) 0.045 1.046 0.036 0.216†(# lower.:upper. middle) 0.029 1.030 0.073 0.688†(# lower.:upper. spread) 0.222 1.249 0.076 0.004(# lower.:upper. begin.) 0.134 1.143 0.074 0.071†(# lower.:sym. middle) -0.146 0.864 0.026 <0.001(# lower.:sym. spread) -0.164 0.849 0.051 0.001(# lower.:sym. begin.) 0.019 1.019 0.041 0.638†(# lower.:birth year) 0.002 1.002 <0.001 <0.001(# upper.:upper. middle) -0.310 0.733 0.111 0.005(# upper.:upper. spread) -0.613 0.542 0.134 <0.001(# upper.:upper. begin.) -0.528 0.590 0.106 <0.001(dig. middle:sym. middle) -1.042 0.353 0.300 <0.001(dig. spread:sym. middle) -0.137 0.872 0.293 0.640†(dig. begin.:sym. middle) -0.314 0.730 0.307 0.306†(dig. middle:sym. spread) 0.207 1.230 0.341 0.545†(dig. spread:sym. spread) 0.225 1.253 0.379 0.552†(dig. begin.:sym. spread) -0.602 0.548 0.559 0.282†(dig. middle:sym. begin.) -0.604 0.547 0.306 0.048

Table 3: Final Cox regression results for all participants, in-cluding composition factors, with interactions. Interactioneffects, shown in parentheses, indicate that combination oftwo factors is associated with stronger (negative coefficient) orweaker (positive coefficient) passwords than would be expectedsimply from adding the individual effects of the two factors.

58% as likely to be guessed. Each additional login during the mea-surement period is associated with an estimated increase in the like-lihood of guessing of 0.026%. Though this effect is statistically sig-nificant, we consider the effect size to be negligible. No significantinteractions between factors were found in the final model.

Notable behavioral factors that do not appear in the final regres-sion include median time between login events, wired login rate (asopposed to wireless), and non-web authentication rate (e.g., usingan email client to retrieve email without using the web interface).

4.2.4 Model 4: Survey participantsAmong survey participants, we find correlations between pass-

word strength and responses to questions about compliance strate-gies and user sentiment during creation. As before, college alsoappears in the final model.

Factor Coef. Exp(coef) SE p-valuelogin count <0.001 1.000 <0.001 <0.001password fail rate -0.543 0.581 0.116 <0.001gender (male) 0.078 0.925 0.027 0.005engineering -0.273 0.761 0.048 <0.001humanities -0.107 0.898 0.054 0.048public policy 0.079 1.082 0.058 0.176†science -0.325 0.722 0.062 <0.001other -0.103 0.902 0.053 0.051†computer science -0.459 0.632 0.055 <0.001business 0.185 1.203 0.054 <0.001

Table 4: Final Cox regression results for personnel with con-sistent passwords, using a model with no interactions. For anexplanation, see Table 1.

Factor Coef. Exp(coef) SE p-valueannoying 0.375 1.455 0.116 0.001substituted numbers -0.624 0.536 0.198 0.002gender (male) -0.199 0.820 0.120 0.098†engineering 0.523 1.693 0.342 0.124†humanities 0.435 1.545 0.367 0.235†public policy 1.000 2.719 0.394 0.011science 0.432 1.541 0.416 0.299†other 0.654 1.922 0.334 0.051†computer science 0.681 1.976 0.351 0.052†business 1.039 2.826 0.376 0.006

Table 5: Final Cox regression results for survey participants.For an explanation, see Table 1.

Perhaps unsurprisingly, users who report that complying withthe university’s password policy was annoying have weaker pass-words, 46% more likely to be guessed than those who do not reportannoyance. This suggests that password policies that annoy usersmay be counterproductive. Users who substitute numbers for someof the letters in a word or name, by contrast, make passwords only54% as likely to be guessed. We do not know whether or not theseare typical “l33t” substitutions. Figures 4-5 illustrate these findingsand full details appear in Table 5. For this subpopulation, there arenot enough data points for a model with interaction to be valid.

Factors that do not appear in the final model include responsesthat complying with the password policy was difficult or fun; abouttwice as many users (302) agreed that it was annoying as agreedthat it was difficult (162), and only 74 users found it fun. In ad-dition, self-reported storage and the reason why the password waschanged are not significant factors.

5. COMPARING REAL AND SIMULATEDPASSWORD SETS

Acquiring high-quality password data for research is difficult,and may come with significant limitations on analyses. As a re-sult, it is important to understand to what extent passwords col-lected in other settings — e.g., from data breaches or online studies— resemble high-value passwords in the wild. In this section, weexamine in detail similarities and differences between the variouspassword sets to which we have access. We first compare guess-ability, then examine other properties related to password compo-sition. Overall, across several measures, passwords from onlinestudies are consistently similar to the real, high-value CMU pass-words. In contrast, passwords leaked from other sources prove tobe close matches in some cases and by some metrics but highlydissimilar in others.

Page 68: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

68

TOOLS

Page 69: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

69

So how do I DO these tests?

•  Excel: Very easy, but not very powerful–  Doesn’t have many useful tests

•  R: Most powerful, steepest learning curve

–  Like Matlab but for stats–  Somewhat bizarre language/API/data representation–  Free and open-source (awesome add-on packages)

•  SPSS: Graphical, also quite powerful

–  Expensive ($25 student license from Terpware)–  Somewhat scriptable, not as flexible as R

Page 70: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

70

R tutorials

•  http://www.statmethods.net

•  http://cyclismo.org/tutorial/R/

Page 71: Quantitative analysis with statistics (and ponies)mmazurek/818D-S16/slides/07-stats.… · Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing

71

Choosing a test

•  http://webspace.ship.edu/pgmarr/Geo441/Statistical%20Test%20Flow%20Chart.pdf

•  http://abacus.bates.edu/~ganderso/biology/resources/statistics.html

•  http://bama.ua.edu/~jleeper/627/choosestat.html

•  http://med.cmb.ac.lk/SMJ/VOLUME%203%20DOWNLOADS/Page%2033-37%20-%20Choosing%20the%20correct%20statistical%20test%20made%20easy.pdf

•  http://fwncwww14.wks.gorlaeus.net/images/home/news/Flowchart2011.jpg