Statistical methods for Data Science, Lecture 5 Interval ...richajo/dit862/L5/l5.pdf · -20pt...

Preview:

Citation preview

Statistical methods for Data Science, Lecture 5Interval estimates; comparing systems

Richard Johansson

November 18, 2018

-20pt

statistical inference: overview

I estimate the value of some parameter (last lecture):I what is the error rate of my drug test?

I determine some interval that is very likely to contain the truevalue of the parameter (today):I interval estimate for the error rate

I test some hypothesis about the parameter (today):I is the error rate significantly different from 0.03?I are users significantly more satisfied with web page A than

with web page B?

-20pt

“recipes”

I in this lecture, we’ll look at a few “recipes” that you’ll use inthe assignmentI interval estimate for a proportion (“heads probability”)I comparing a proportion to a specified valueI comparing two proportions

I additionally, we’ll see the standard method to compute aninterval estimate for the mean of a normal

I I will also post some pointers to additional testsI remember to check that the preconditions are satisfied: what

kind of experiment? what assumptions about the data?

-20pt

overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

-20pt

interval estimates

I if we get some estimate by ML, can we say something abouthow reliable that estimate is?

I informally, an interval estimate for the parameter p is aninterval I = [plow , phigh] so that the true value of theparameter is “likely” to be contained in I

I for instance: with 95% probability, the error rate of the spamfilter is in the interval [0.05, 0.08]

-20pt

frequentists and Bayesians again. . .

I [frequentist] a 95% confidence interval I is computed usinga procedure that will return intervals that contain p at least95% of the time

I [Bayesian] a 95% credible interval I for the parameter p isan interval such that p lies in I with a probability of at least95%

-20pt

interval estimates: overview

I we will now see two recipes for computing confidence/credibleintervals in specific situations:I for probability estimates, such as the accuracy of a classifier

(to be used in the next assignment)I for the mean, when the data is assumed to be normal

I . . . and then, a general method

-20pt

the distribution of our estimator

I our ML or MAP estimator applied to randomly selectedsamples is a random variable with a distribution

I this distribution depends on thesample sizeI large sample → more concentrated

distribution

0.0 0.2 0.4 0.6 0.8 1.0n = 25

0.00

0.05

0.10

0.15

-20pt

estimator distribution and sample size (p = 0.35)

0.0 0.2 0.4 0.6 0.8 1.0n = 10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.0 0.2 0.4 0.6 0.8 1.0n = 25

0.00

0.05

0.10

0.15

0.0 0.2 0.4 0.6 0.8 1.0n = 50

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.0 0.2 0.4 0.6 0.8 1.0n = 100

0.00

0.02

0.04

0.06

0.08

0.10

-20pt

confidence and credible intervals for the proportionparameter

I several recipes, see https:

//en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

I traditional textbook method for confidence intervals is basedon approximating a binomial with a normal

I instead, we’ll consider a method to compute a Bayesiancredible interval that does not use any approximationsI works fine even if the numbers are small

-20pt

credible intervals in Bayesian statistics

1. choose a prior distribution

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

2. compute a posterior distribution from the prior and the data

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

3. select an interval that covers e.g. 95% of the posteriordistribution

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

-20pt

recipe 1: credible interval for the estimation of a probability

I assume we carry out n independent trials, with k successes,n − k failures

I choose a Beta prior for the probability; that is, select shapeparameters a and b (for uniform prior, set a = b = 1)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

I then the posterior is also a Beta, with parameters k + a and(n − k) + b

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

I select a 95% interval

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

-20pt

in Scipy

I assume n_success successes out of nI recall that we use ppf to get the percentiles!I or even simpler, use interval

a = 1b = a

n_fail = n - n_successposterior_distr = stats.beta(n_success + a, n_fail + b)

p_low, p_high = posterior_distr.interval(0.95)

-20pt

example: political polling

I we ask 87 randomly selected Gothenburgers about whetherthey support the proposed aerial tramway line over the river

I 81 of them say yesI a 95% credible interval for the popularity of the tramway is

0.857 – 0.967

n_for = 81n = 87n_against = n - n_for

p_mle = n_for / n

posterior_distr = stats.beta(n_for + 1, n_against + 1)

print(’ML / MAP estimate:’, p_mle)print(’95% credible interval: ’, posterior_distr.interval(0.95))

-20pt

don’t forget your common sense

I I ask 14 Applied Data Science students about whether theysupport free transporation between Johanneberg andLindholmen, 12 of them say yes

I will I get a good estimate?

-20pt

recipe 2: mean of a normal

I we have some sample that we assume follows some normaldistribution; we don’t know the mean µ or the standarddeviation σ; the data points are independent

I can we make an interval estimate for the parameter µ?

I frequentist confidence intervals, but also Bayesian credibleintervals, are based on the t distributionI this is a bell-shaped distribution with longer tails than the

normal

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.000.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

I the t distribution has a parameter called degrees of freedom(df) that controls the tails

-20pt

recipe 2: mean of a normal

I we have some sample that we assume follows some normaldistribution; we don’t know the mean µ or the standarddeviation σ; the data points are independent

I can we make an interval estimate for the parameter µ?I frequentist confidence intervals, but also Bayesian credible

intervals, are based on the t distributionI this is a bell-shaped distribution with longer tails than the

normal

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.000.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

I the t distribution has a parameter called degrees of freedom(df) that controls the tails

-20pt

recipe 2: mean of a normal (continued)

I x_mle is the sample mean; the size of the dataset is n; thesample standard deviation is s

I we consider a t distribution:posterior_distr = stats.t(loc = x_mle, scale = s/np.sqrt(n), df = n-1)

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.000.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

I to get an interval estimate, select a 95% interval in thisdistribution

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

-20pt

example

I to demonstrate, we generate some data:

x = pd.Series(np.random.normal(loc=3, scale=0.5, size=500))

I a 95% confidence/credible interval for the mean:

mu_mle = x.mean()

s = x.std()n = len(x)

posterior_distr = stats.t(df=n-1, loc=mu_mle, scale=s/np.sqrt(n))

print(’estimate:’, mu_mle)print(’95% credible interval: ’, posterior_distr.interval(0.95))

-20pt

alternative: estimation using bayes_mvs

I SciPy has a built-in function for the estimation of mean,variance, and standard deviation:https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/

scipy.stats.bayes_mvs.html

I 95% credible intervals for the mean and the std:

res_mean, _, res_std = stats.bayes_mvs(x, 0.95)

mu_est, (mu_low, mu_high) = res_meansigma_est, (sigma_low, sigma_high) = res_std

-20pt

recipe 3 (if we have time): brute force

I what if we have no clue about how our measurements aredistributed?I word error rate for speech recognitionI BLEU for machine translation

-20pt

the brute-force solution to interval estimates

I the variation in our estimate depends on the distribution ofpossible datasets

I in theory, we could find a confidence interval by consideringthe distribution of all possible datasets, but this can’t be donein practice

I the trick in bootstrapping – invented by Bradley Efron – is toassume that we can simulate the distribution of possibledatasets by picking randomly from the original dataset

-20pt

the brute-force solution to interval estimates

I the variation in our estimate depends on the distribution ofpossible datasets

I in theory, we could find a confidence interval by consideringthe distribution of all possible datasets, but this can’t be donein practice

I the trick in bootstrapping – invented by Bradley Efron – is toassume that we can simulate the distribution of possibledatasets by picking randomly from the original dataset

-20pt

bootstrapping a confidence interval, pseudocode

I we have a dataset D consisting of k itemsI we compute a confidence interval by generating N random

datasets and finding the interval where most estimates end up

repeat N timesD∗ = pick k items randomly from Dm = estimate on D∗

store m in a list Mreturn 2.5% and 97.5% percentiles of M

0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.980

500

1000

1500

2000

2500

3000

3500

4000

I see Wikipedia for different varieties

-20pt

overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

-20pt

statistical significance testing for the accuracy

I in the assignment, you will consider two questions:I how sure are we that the true accuracy is different from 0.80?I how sure are we that classifier A is better than classifier B?

I we’ll see recipes that can be used in these two scenariosI these recipes work when we can assume that the “tests” (e.g.

documents) are independentI for tests in general, see e.g. Wikipedia

-20pt

comparing the accuracy to some given value

I my boss has told me to build a classifier with an accuracy ofat least 0.70

I my NB classifier made 40 correct predictions out of 50I so the MLE of the accuracy is 0.80

I based on this experiment, how certain can I be that theaccuracy is really different from 0.70?

I if the true accuracy is 0.70, how unusual is our outcome?

-20pt

null hypothesis significance tests (NHST)

I we assume a null hypothesis and then see how unusual(extreme) our outcome isI the null hypothesis is typically “boring”: the true accuracy is

equal to 0.7I the “unusualness” is measured by the p-value

I if the null hypothesis is true, how likely are we to see anoutcome as unusual as the one we got?

I the traditional threshold for p-values to be considered“significant” is 0.05

-20pt

the exact binomial test

I the exact binomial test is used when comparing an estimatedprobability/proportion (e.g. the accuracy) to some fixed valueI 40 correct guesses out of 50I is the true accuracy really different from 0.70?

I if the null hypothesis is true, then this experiment correspondsto a binomially distributed r.v. with parameters 50 and 0.70

I we compute the p-value as the probability of getting anoutcome at least as unusual as 40

-20pt

historical side note: sex ratio at birth

I the first known case where a p-value was computed involvedthe investigation of sex ratios at birth in London in 1710

I null hypothesis: P(boy) = P(girl) = 0.5I result: p close to 0 (significantly more boys)

“From whence it follows, that it is Art, not Chance, that governs.”(Arbuthnot, An argument for Divine Providence, taken from the constant

regularity observed in the births of both sexes, 1710)

-20pt

example

I 40 correct guesses out of 50I if the true accuracy is 0.70, is 40 out of 50 an unusual result?

0 10 20 30 40 500.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

outcome

the p-value is 0.16

-20pt

example

I 40 correct guesses out of 50I if the true accuracy is 0.70, is 40 out of 50 an unusual result?

0 10 20 30 40 500.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

outcome

I the p-value is 0.16, which isn’t “significantly” unusual!

-20pt

implementing the exact binomial test in Scipy

I assume we made x correct guesses out of nI is the accuracy significantly different from test_acc?I the p-value is the sum of the probabilities of the outcomes

that are at least as “unusual” as x:import scipy.stats

def exact_binom_test(x, n, test_acc):rv = scipy.stats.binom(n, test_acc)p_x = rv.pmf(x)p_value = 0for i in range(0, n+1):

p_i = rv.pmf(i)if p_i <= p_x:

p_value += p_ireturn p_value

I actually, we don’t have to implement it since there is afunction scipy.stats.binom_test that does exactly this!

-20pt

overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

-20pt

comparing two classifiers

I I’m comparing a Naive Bayes and a perceptron classifierI we evaluate them on the same test setI the NB classifier had 186 correct out of 312 guessesI . . . and the perceptron had 164 correct guessesI so the ML estimates of the accuracies are 0.60 and 0.53,

respectivelyI but does this strongly support that the NB classifier is really

better?

-20pt

contingency table

I we make a table that compares the errors of the two classifiers:

NB correct NB incorrectperc correct A = 125 B = 39perc incorrect C = 61 D = 87

I if NB is about as good as the perceptron, the B and C valuesshould be similarI conversely if they are really different, B and C should differ

I are these B and C value unusual?

-20pt

McNemar’s test

I in McNemar’s test, we model the discrepancies (the B and Cvalues)

NB correct NB incorrectperc correct A = 125 B = 39perc incorrect C = 61 D = 87

I there are a number of variants of this testI the original formulation:

Quinn McNemar (1947). Note on the sampling error of thedifference between correlated proportions or percentages,Psychometrika 12:153-157.

I our version builds on the exact binomial test that we sawbefore

-20pt

McNemar’s test (continued)

NB correct NB incorrectperc correct A = 125 B = 39perc incorrect C = 61 D = 87

I the number of discrepancies is B+CI how are the discrepancies distributed?

I if the two systems are equivalent, the discrepancies should bemore or less evenly spread into the B and C boxes

I it can be shown that B would be a binomial random variablewith parameters B+C and 0.5

I so we can find the p-value (the “unusualness”) like this:p_value = scipy.stats.binom_test(B, B+C, 0.5)

I in this case it is 0.035, supporting the claim that NB is better

-20pt

alternative implementation

http://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html

-20pt

overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

-20pt

searching for significant effects

I scientific investigations sometimes operate according to thefollowing procedure:1. propose some hypothesis2. collect some data3. do we get a “significant” p-value over some null hypothesis?4. if no, revise hypothesis and go back to 3.5. if yes, publish your findings, promote them in the media, . . .

-20pt

searching for significant effects (alternative)

I or a “data science” experiment:1. you are given some dataset and told to “extract some

meaning” from it2. look at the data until you find a “significant” effect3. publish . . .

-20pt

searching for significant effects

I remember: if the null hypothesis is true, we will still see“significant” effects about 5% of the time

I consequence: if we search long enough, we will probably findsome effect with a p-value that is smallI even if this is just due to chance

-20pt

spurious correlations

Num

ber o

f people

killed b

y v

enom

ous sp

iders

Spelli

ng b

ee w

innin

g w

ord

Letters in winning word of Scripps National Spelling Beecorrelates with

Number of people killed by venomous spiders

Number of people killed by venomous spidersSpelling Bee winning word

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

0 deaths

5 deaths

10 deaths

15 deaths

5 letters

10 letters

15 letters

tylervigen.com

-20pt

“data dredging”: further reading

https://en.wikipedia.org/wiki/Data_dredging

https:

//en.wikipedia.org/wiki/Testing_hypotheses_suggested_by_the_data

-20pt

some solutions

I common senseI held-out data (a separate test set)I correcting for multiple comparisons

-20pt

Bonferroni correction for multiple comparisons

I assume we have an experiment where we carry out Ncomparions

I in the Bonferroni correction, we multiply the p-values of theindividual tests by N (or alternatively, divide the “significance”threshold by N)

-20pt

Bonferroni correction for multiple comparisons: example

-20pt

the rest of the week

I Wednesday: Naive Bayes and evaluation assignmentI Thursday: probabilistic clustering (Morteza)I Friday: QA hours (14–16)

Recommended