New Chapter II Simulation methods in for classical statistics · 2020. 4. 25. · Introduction Classical statistical theory contains many methods for testing hypotheses in numerous

MAS472/6004: Computational Inference

Chapter IISimulation methods in for

classical statistics

1 / 96

Introduction

Classical statistical theory contains many methods for testinghypotheses in numerous different situations.

Derivation of these tests can be difficult or impossible in somecases and often relies on asymptotic results or approximations.

If the test we wish to perform is non-standard then deriving asuitable test procedure may not be possible (or we may haveforgotten the correct test!).

In this Chapter we consider what can be done using simulationmethods.

2 / 96

2.1 Monte Carlo testsRecap of hypothesis testing framework

Suppose that we have a null hypothesis H0 represented by acompletely specified model and that we wish to test thishypothesis using data X1, . . . , Xn. We proceed as follows

1. Assume H0 is true.

2. Find a test statistic T (X1, . . . , Xn) for which large valuesindicate departure from H0.

3. Calculate the theoretical sampling distribution of T underH0.

4. The observed value Tobs = T (x1, . . . , xn) of the test statisticis compared with the distribution of T under H0. Either

I (Neyman-Pearson) reject H0 if Tobs > c. Here c is chosen sothat P(T ≥ c|H0) = α where α is the size of the test, i.e.,P(reject H0|H0 true) = α.

I (Fisher) compute the p-value p = P(T ≥ Tobs|H0) andreport it. This represents the strength of evidence againstH0.

3 / 96

Example 1: normal parametric testSuppose X1, . . . , Xn

iid∼ N(µ, σ2). Suppose that σ2 = 1 is known.Consider the null hypothesis

H0 : µ = 0.

Testing proceeds as follows:

1. Assume µ = 0.

2. The Neyman-Pearson lemma that a suitable test statistic isT (X) = 1

n

∑Xi.

3. Under H0 we can show T ∼ N(0, σ2

n ) so that

Z = n1/2T/σ ∼ N(0, 1).

4. If Zobs = 2 then we find the p-valueP(Z ≥ Zobs|H0) = 1− Φ(2) = 0.023 and so we reject H0 atthe 5% level.

What test do we perform if σ2 is unknown? What about if theXi have t-distributions rather than normal distributions? Whatif the Xi are not iid?

4 / 96

It may not be possible to derive the sampling distribution of Tunder H0.

I T is not some fairly simple function,

I or if X1, . . . , Xn are not independent samples from thepopulation of interest (dependent data are common in realproblems).

Moreover, in deriving the distribution of T , we assume that n islarge, equal variances, normality etc. If these assumptions don’thold then our distribution for T will be incorrect.

5 / 96

Monte Carlo TestsWe may not know the distribution of T under H0, but often it ispossible to simulate from the model to produce sample data sets

{X(i)1 , . . . , X(i)

n }

for i = 1, . . . ,m− 1.

From these we can calculate m− 1 sample values of the statisticunder H0,

{T1, . . . , Tm−1}

We can then estimate the distribution of T under H0 from thissample and can estimate the critical value c or the p-value by aMonte Carlo approximation, i.e., estimate P(T > Tobs|H0) by

1

m− 1

m−1∑i=1

ITi>tobs .

6 / 96

Monte Carlo Testing Algorithm

1. Generate m− 1 sample test statistics t1, . . . , tm−1according to H0.

2. For a test of size α, define k = mα. If tobs is one of thekth largest values in {T1, . . . , Tm−1, tobs} then rejectH0.

i.e. reject H0 if tobs > T(m−k)

where T(1), . . . , T(m) are the order statistics of T1, . . . , Tm−1, tobs.

7 / 96

Example: normal parametric test revisited

In this simple case we know the distribution of T under H0, butit is informative to consider the Monte Carlo test.

1. Generate 1000 samples of size n from a N(0, σ2)distribution and calculate T .

t.sample <- c()

for(i in 1:999){temp <- rnorm(n=n, mean=0, sd=sigma)

t.sample[i] <- mean(temp)

}z.sample <- t.sample*sqrt(n)/sigma

z <- c(z.sample, 2)

# add observation Zobs = 2 to simulated data

8 / 96

For α = 0.05, we find the 95th percentile of the samplingdistribution

c<-quantile(z, 0.95)

Then we compare c with the observed value of 2. I foundc = 1.67 so we would reject H0 at the 5% level.

If instead, we wanted to estimate the p-value P(T ≥ Tobs|H0) wecould estimate it using the R command

sum(z>2)/1000

For my implementation I found a p-value of 0.028 which againsuggests we should reject H0 at the 5% level.

Note that this is a random test: if we repeat it multiple timeswe will get a slightly different answer each time.

9 / 96

Example 2: Chi-squared tests

Exam grades are to be compared between 16 boys and 19 girlsin a single class. The data are

A B C D

boys 3 4 5 4girls 8 8 3 0

The null hypothesis is that there is no difference between boysand girls in exam performance.

In other words, a girl and boy chosen at random have the sameprobability of obtaining any particular grade.

10 / 96

To apply the standard chi-squared test in this case we wouldcalculate the table of expected values under H0 and thencalculate the test statistics

T =∑ (Oi − Ei)2

Ei(1)

Calculating for the data we find Tobs = 7.907, which has ap-value of 0.048.

However, as a rule of thumb, to use the χ2 test, the expectednumber of counts in each cell should be at least 5. In this case,4 of the 8 values are less than 5, which means the assumptionsused to show that T has a χ2 distribution with 3 degrees offreedom do not hold.

11 / 96

Consider using a Monte Carlo test to perform the test.

1. Under H0, probabilities of obtaining each grade are givenby the estimates

A B C D

probability 1135

1235

835

435

2. We then generate a new set of results for boys and girls;the boys’ results are sampled from aMultinomial(16, 1135 ,

1235 ,

835 ,

435) distribution, and the girls’

from a Multinomial(19, 1135 ,1235 ,

835 ,

435). An example is shown

below:

A B C D

boys 3 5 6 2girls 4 5 7 3

3. Calculate T for these data, which for this simulated datasetis 5.323.

We then repeat m− 1 times to get T1, . . . , Tm−1

12 / 96

We rank T1, . . . , Tm−1 together with the observed value of thetest statistic Tobs. In R we generate 99 test statistics and find75 to be less than Tobs, and 24 to be greater. In this case thenull hypothesis is not rejected at the 5% level.

Notice that this is a different conclusion to that reached usingthe χ2 test.

In this case, the Monte Carlo test should preferred aswe are working with the true distribution of the teststatistic and not an approximation.

In general, when conducting hypothesis tests we do not have tobe so reliant on distributional approximations, and we shouldalways consider the option of working with exactdistributions.

13 / 96

Example 3: Testing for randomness in spatial patternsH0: The spatial locations of each subject are randomlydistributed over the unit square: both coordinates have U [0, 1]distributions.

Various possibilities for test statistic. We will considernearest-neighbour of each subject.

Let di denote distance from subject i to next closest subject,and define the test statistic T to be

T =

(50∑i=1

di

)−1. (2)

If locations are clustered, nearest neighbours will be small ⇒ Twill be large.

14 / 96

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x[,1]

x[,

2]

15 / 96

Under H0, don’t know theoretical sampling distribution of T .Straightforward to simulate values of T under H0, so canestimate the critical values (such as the 95th percentile) weneed for the hypothesis test.

1. Generate locations (x, y) of each subject by sampling x andy independently from U [0, 1].

2. For each subject, find the closest observation and measurethe distance to it to obtain the nearest-neighbour distancefor that observation

3. Take the reciprocal of the sum of the 50 nearest-neighbourdistances to get Ti.

Given a sample T1, . . . , Tm−1, we then rank T1, . . . , Tm−1 andthe observed Tobs in order to give T(1), . . . , T(m). For a test ofsize 5%, if Tobs is one of the 0.05×m largest values, then H0 isrejected.

16 / 96

p-valuesWe can estimate the p-value P(T ≥ Tobs|H0) of a Monte Carlotest by looking at the number of observations greater than Tobs

p =1

m

(m−1∑i=1

ITi≥Tobs + 1

)Exercise: If p = P(T ≥ Tobs|H0) show that p ≥ 1

m and

m−1∑i=1

ITi≥Tobs ∼ Bin(m− 1, p)

so that the estimate p has expectation

E(p) = p+1− pm

and is therefore a biased estimator of p. Note that for large mthe bias is small.

17 / 96

How large should m be?

We need to choose m sufficiently large so that the randomsample T1, . . . , Tm−1 allows us to estimate the critical region toa sufficient degree of accuracy.

The Monte Carlo test has a random critical point and so ‘blurs’the critical region.

I We reject H0 if Tobs is one of the k − 1 largest values in{T1, . . . , Tm−1}, where m = kα.

I If p is the true p-value then we reject H0 with probability

R(p) =

k−1∑r=0

(m− 1

r

)pr(1− p)m−r−1

= P(Bin(m− 1, p) ≤ k − 1)

18 / 96

0.00 0.02 0.04 0.06 0.08 0.10 0.12

0.0

0.2

0.4

0.6

0.8

1.0

p−value

Reje

ction P

robabili

ty

n=100

n=200

n=1000

n=10000

19 / 96

R(p) can be interpreted as the proportion of times the MonteCarlo test will reject H0 when we observe Tobs.

For p-values smaller than 0.05 we want R(p) to be large and forp-values greater than 0.05 we want R(p) to be small. We choosem to make this so.

We conclude from the figure that a sample size of m = 100 isusually acceptable as long as the results aren’t interpreted toorigidly.

Of course, this is only an issue if generating test statisticsrequires substantial computational effort. If it is trivial togenerate sample test statistics (which it is in all but the mostcomplex of cases), then a much large value of m can be used.

20 / 96

2.2 Randomisation Tests

Monte Carlo tests allowed us to do hypothesis tests when thenull hypothesis specified a complete distribution for the data,e.g., H0 : Xi ∼ N(0, 1).

We now consider a second technique known as randomisationtests for deriving the sampling distribution of the test statistic,where no distributional assumptions about the data arerequired.

The general scenario under consideration is that of aninvestigation into whether or not a particulartreatment/covariate/factor has an effect on some response.

Our aim is to test this without fully specifying a distribution forthe data.

21 / 96

Example 1: Cholesterol data

A small study was conducted to investigate the effect of diet oncholesterol levels. Volunteers were randomly allocated to one oftwo diets, and cholesterol levels were recorded at the end of thetrial period

Diet A 233 291 312 250 246 197 268 224

Diet B 185 263 246 224 212 188 250 148

The interest is in whether or not there is a significant differencebetween the mean cholesterol levels for the two groups. Thenull hypothesis is

H0 : mean cholesterol levels with the diets are equal

22 / 96

A standard classical analysis of this data might be to assume

X(j)i ∼ N(µj , σ

2)

for i = 1, . . . , 8 and j = 1, 2 with σ2 an unknown commonvariance.

The standard test is then a two sample t-test, based on thestatistic

T =X(1) − X(2)√s2/8 + s2/8

, (3)

where s2 is the pooled estimate of variance.

Then under H0 (and assuming normality of the data!), the teststatistic T has a t-distribution with 14 degrees of freedom.

For this data, the observed test statistic Tobs is 2.0034 with ap-value of 0.0649 for a two-sided test.

23 / 96

But what if we want to analyse the data without assumingnormality? E.g., because the sample sizes are small

Randomization tests can be used to find a distribution for Twithout making any distributional assumptions about the data.

If H0 is true, then any difference in the two sample meanswould be solely due to how the 16 individuals were assigned tothe two groups. So if H0 is true, what is the probability ofobserving a sizeable difference between the two group means?

It must be equal to the probability of assigning the individualsto the two groups in such a way that the imbalance occurs, aslong as the individuals were assigned to the two groups atrandom in the actual study. This is the principle idea behindrandomisation tests.

24 / 96

Randomisation Test

1. Suppose the 16 individuals in the study have been labelled

Diet A 1 2 3 4 5 6 7 8

Diet B 9 10 11 12 13 14 15 16

2. Randomly re-assign the 16 individuals to the two groups.

3. Re-calculate the test-statistic for this permuted data

4. Repeat 2 and 3 to obtain B sampled test-statistics, denotedT1, . . . , TB.

5. For a two-sided test, the estimated p-value of the observedtest statistic Tobs is

1

B

B∑i=1

I|Ti|≥|Tobs|

Using 10000 random permutations gave a p-value of 0.063.

25 / 96

Equivalent test statistics

The significance level of Tobs is determined using

p =1

B

B∑i=1

I{|Ti| ≥ |Tobs|}.

Notice that multiplying Tobs and all Tis by some constant wouldhave no effect on significance level; ordering would be preserved.An equivalent test statistic is one that preserves orderingand hence does not change the p-value. In the example, anequivalent test statistic would be

T = X1 − X2. (4)

ie no need to compute the denominator in Equation (3).

26 / 96

Exact randomisation tests

Could consider systematically every possible permutation,rather than a random sample of permutations to determine thesignificance level.

I This is known as an exact randomisation test orpermutation test

I Can be computationally demanding/impracticable ifnumber of possible permutations is large.

I A large sample of random permutations should besufficient.

27 / 96

Outliers

In parametric tests, outlying observations in the data can causeproblems.

I In the comparison of means problem, an outlier canincrease the difference X(1) − X(2) and will inflate thewithin group variance.

I Consequently the true significance of the test statistic maybe underestimated.

In a randomisation test, you are comparing the relative size ofthe observed test statistic to its value under alternative randompermutations.

Hence, the outlier will not have the same effect.

28 / 96

Example 2

This is illustrated in some data from a study reported in Ezinga(1976) for two treatments A and B:

A 0.33 0.27 0.44 0.28 0.45 0.55 0.44 0.76 0.59 0.01

B 0.28 0.80 3.72 1.16 1.00 0.63 1.14 0.33 0.26 0.63

The sample group means are XA = 0.412 and XB = 0.995, andthe observed test statistic for a two sample t-test is T = 1.78.

For a two-tailed test this gives a p-value of 0.11, so notsignificant at the 5% level. Using a randomisation test, T isnow significant at the 5% level with a p-value of about 0.03.

Exercise: Check this conclusion in R

29 / 96

Example 3: Analysis of Variance

Randomisation tests are applicable in many different contexts.Analysis of variance is another example. Below are responsesmeasured on four treatment groups:

Group A -0.10 -1.10 0.74 -3.80Group B 0.94 -0.30 0.67 0.86 1.19Group C -0.25 0.84 0.04 0.25Group D 0.99 0.08 0.98 0.75 0.53

Test the null hypothesis

H0 : all four groups have equal means

Qn: What classical hypothesis test would you use?A conventional F test (one-way anova) could be used: the ratioof the between-group sum of squares to the within-group sum ofsquares is compared with the F3,14 distribution. The p-value forthe observed F statistic is 0.08.

30 / 96

Alternatively, a randomisation test could be applied:

1. Randomly re-assign the observations to the fourtreatments, keeping the numbers in each treatment thesame.

2. Evaluate the test statistic

F =(∑4

i=1 ni(xi − x)2)/3

(∑4

i=1

∑nij=1(xi,j − xi)2)/14

for the permuated data.

3. Repeat steps 1 and 2 B times to obtain sampled teststatistics F1, . . . , FB.

4. Estimate the significance level of Fobs by

1

B

B∑i=1

IFi≥Fobs.

Based on a sample size B = 10000, the estimated p-value forFobs was 0.03, suggesting slightly stronger evidence against thenull hypothesis (compared with the parametric test).

31 / 96

Example 4: One-sample randomisation testsRandomisation tests can be used for one-sample problems, butunder stricter assumptions. This is demonstrated with thefollowing example:

Given observations

{10.61, 9.46, 7.02, 11.68, 9.58, 11.96, 11.28, 7.63, 6.42, 8.85}

drawn from some population with mean µ, test the nullhypothesis

H0 : µ = 10.

It is not immediately obvious what can be permuted here.However, supposing the two following assumptions hold:

I Each observation has been sampled randomly from itspopulation

I The population distribution is symmetric about its mean.

32 / 96

Now suppose H0 is true, and consider randomly sampling avalue X from the population, and then evalutating Y = X − 10.If the population distribution is symmetric about 10, then Ymust have an equal probability of being positive or negative.

In this example, subtracting 10 from each observation andtaking the resulting mean gives a sample mean of -0.551. Wewill use the absolute value of this sample mean as the teststatistic, so Tobs = 0.551 (for a two-sided test).

T =

∣∣∣∣∣ 1

B

B∑i=1

Yi

∣∣∣∣∣If H0 is true, and both assumptions hold, then the observedsample mean could simply be due to an imbalance of positiveand negative Y values. This can be tested as follows:

33 / 96

Fisher’s Randomisation test

1. Subtract hypothesised population mean from eachobservations:

{0.61,−0.54,−2.98, 1.68,−0.42, 1.96, 1.28,−2.37,−3.58,−1.15}

2. Calculate the observed test statistics: Tobs = 0.551

3. With probability 0.5 for each observation, change the sign ofX − µ. E.g.

{−0.61,−0.54,−2.98,−1.68, 0.42, 1.96,−1.28,−2.37,−3.58,−.15}

4. Re-calculate the test-statistic for the new simulatedobservations: T = 0.951.

5. Repeate 3 and 4 to obtain B sampled test-statisticsT1, . . . , TB.

6. Estimate the significance of Tobs by 1B

∑Bi=1 I|Ti|≥|Tobs|

34 / 96

With B = 10000, the estimated significance of Tobs is 0.4021.

Using a conventional t-test, the significance of Tobs is 0.3982, sothere is close agreement between the two methods in thisexample.

35 / 96

Example 5

Two treatments A and B, unknown population means µA andµB.

treatment A 130 119 119 168 130

treatment B 154 115 169 137 186

Consider H0 : µB − µA = 20.

How would we test this using a randomisation test?

Cannot just permute data and evaluate difference between thesample means, as population means not equal under H0.

36 / 96

Suppose we were to add 20 to each observation in group A.Under H0, what is the expectation of {20+ a response in groupA}?

If H0 is true, then this expectation is 20 + µA = µB.Adding 20 to each response in group A:

treatment A+20 150 139 139 188 150

treatment B 154 115 169 137 186

Under H0 groups have equal population means. We can nowuse randomisation test in the usual way.

37 / 96

Summary of randomisation tests

Some argue that randomisation tests should always be used, assamples of data are never truly randomly drawn from thepopulation of interest; some members of the population arealways going to be more accessible than others.

On the other side, there is no theory to show that the results ofa randomisation test can be generalised to the wholepopulation; evidence against the null hypothesis is obtained forthe observed sample only.

Consequently, in either case, a ‘non-statistical’ judgement hasto made; that the sample can be treated as effectively randomfor a conventional test, or that the results can be generalised tothe population for a randomisation test.

38 / 96

Two advantages of randomisation tests are that they can beused for any test statistic (i.e. in cases when it is not possible toanalytically derive the distribution of the test statistic), andthat we don’t have to assume a particular distribution for thedata.

Note that in most of the examples, almost identical results wereobtained using the two methods. In this case, therandomisation test could be seen as a means of supporting theresults from the parametric test.

The requirement for the randomisation test to be valid is thatthe subjects are assigned randomly to each treatment. Ifrandom allocation is not explicitly part of the experimentalprocedure then there needs to be the belief that the actualallocation was as likely to occur as any other.

39 / 96

2.3 Bootstrapping

The bootstrap is a method for assessing properties of astatistical estimator in a non-parametric framework. That is,we do not assume that the data are obtained from anyparametric distribution (eg. normal, exponential etc).

The bootstrap is usually used to assess the variance of astatistical estimator but it is not exclusively used for thispurpose.

The name comes from the story ‘The Surprising Adventures ofBaron Munchausen’, where the main character pulls himself outof a swamp, by pulling on his own bootstraps.

The idea behind bootstrapping is that we can use the datamuliple times to generate ‘new’ data sets to assess theproperties of parameters.

40 / 96

Recap: CDFs

41 / 96

2.3.1 The Empirical Distribution Function

Define

F (x) =1

n

n∑i=1

IXi≤x

to be the empirical distribution function (edf) for data{X1, . . . , Xn}.

I F takes values in {0, 1n , . . . ,nn}

I to sample from F we sample WITH REPLACEMENTfrom {X1, . . . , Xn}.

Note that F is a random quantity. We consider the edf to be anestimator for F . If Xi are all from distribution F then thefollowing results hold.

42 / 96

Properties of the EDF - I

1. F (x) is an unbiased estimator of F (x).

EF (x) = F (x)

Proof

EF (x) =1

n

∑EIXi≤x =

1

n

∑P(Xi ≤ x)

=1

n

∑F (x) = F (x)

43 / 96

Properties of the EDF - II

2. F (x)→ F (x) as n→∞ with probability 1.

Proof F (x) = 1n

∑Yi where Yi = IXi≤x and Yi are iid rvs

with EYi = F (x) and Var(Y ) = F (x)(1− F (x)) ≤ ∞.Therefore the strong law of large numbers says

F (x) =1

n

n∑i=1

IXi≤n → F (x)

as n→∞ with probability 1.

44 / 96

Properties of the EDF - III

3. √n(F (x)− F (x))√F (x)(1− F (x))

→ N(0, 1) in distribution

as n→∞.

Proof: By the central limit theorem.

45 / 96

Properties of the EDF - IV

4 If Xi are an independent identically distributed sequence(so it doesn’t matter if we change the order), thenknowledge of F is equivalent to knowledge of {X1, . . . , Xn}.

46 / 96

Example of the EDFSuppose Xi ∼ Cauchy. Then we can examine the edf forincreasing values of n.

−10 −5 0 5 10

0.0

0.2

0.4

0.6

0.8

1.0

EDF n=10

x

Density

−10 −5 0 5 10

0.0

0.2

0.4

0.6

0.8

1.0

EDF n=10

x

Density

−10 −5 0 5 10

0.2

0.4

0.6

0.8

1.0

EDF n=50

x

Density

47 / 96

Example of the EDF - II

−10 −5 0 5 100.0

0.4

0.8

EDF n=100

x

Density

−10 −5 0 5 10

0.0

0.4

0.8

EDF n=1000

x

Density

Notice that we’ve repeated for n = 10 and that the edf isdifferent each time. Notice also that the edf becomes moreaccurate as n gets larger.

48 / 96

Parameters, statistics and propertiesBootstrapping texts can sometimes be confusing because of thelanguage usage. We say

I θ is a parameter if it is a property of the underlyingpopulation, i.e., θ = θ(F ).

I θ is a statistic which estimates θ if θ is a function of thesample X1, . . . , Xn - this is equivalent to being a functionof the empirical distribution function

θ = θ(F ) ≡ θ(X)

I We then talk about properties of θ such as its bias, itsexpectation, its standard error etc. For bootstrappingapplications these properties are usually samplingproperties, that is, if we repeatedly collected similarsamples, what properties would θ have?

Difficulties arise when we note that properties are alsoparameters of the statistic θ and that we estimate them withstatistics of the statistics.

49 / 96

Plug-in Principle

For example, suppose we have a sample of size n, {X1, . . . , Xn}say, from unknown density F .

Suppose that interest lies in some parameter θ of thedistribution F which we write θ = θ(F ) where we consider θ tobe a functional of the distribution F .

We estimate θ by θ where θ is a function of the observations{X1, . . . , Xn}. Usually we have that θ = θ(F ), that is, if weapply the functional θ(·) to the edf F we get the statistic θ.

The parameter θ and the statistic θ are both found by using thefunctional θ(·). For the parameter we have θ = θ(F ), and forthe statistic we have θ = θ(F ).

This is what we call the plug-in principle. To estimateparameter θ = θ(F ) when we don’t know F , we plug-in theempirical distribution function F to find the estimatorθ = θ(F ).

50 / 96

Examples of parameters and the plug-in principle

1 Population mean

θ = θ(F ) = EFX =

∫xdF(x) =

∫xf(x)dx

Use the plug-in principle

θ = θ(F ) =1

n

∫x∑

IXi≤x(dx)

=1

n

∑∫xδ(x−Xi)dx

=1

n

∑Xi

which is the sample mean, which is the usual estimator ofthe population mean.

51 / 96

Here δ(x) is the Dirac delta function which is defined by itsbehaviour under integration∫

Aδ(x− a)dx =

{1 if a ∈ A0 if a 6∈ A

The delta function δ(x− a) is the derivative of the indicatorfunction Ix≤a.

52 / 96


2 Population variance

θ = θ(F ) = VarF (X) =

∫(x− EF (X))2dF(x)

Using the plug-in principle we find that the statistic θ is

θ = θ(F ) =1

n

∫(x− X)2

∑IXi≤x(dx)

=1

n

∑∫(x− X)2δ(x−Xi)dx

=1

n

∑(Xi − X)2

This is not quite the usual estimator of variance( 1n−1

∑(Xi − X)2) as usually we multiply this value by

n/(n− 1) to get an unbiased estimator. θ is biased in thiscase.

53 / 96


3 Probability

θ = PF (X > c) =

∫ ∞c

dF(x)

which is estimated by the statistic

θ =1

n

∑IXi>c

54 / 96

2.3.2 Estimating sampling properties with the bootstrap

For our estimates to be of any value, it is necessary to knowtheir properties, such as the bias or the standard error:

I The bias is defined as

bias(θ) = Eθ − θ

I The standard error is

se(θ) =

√[E(θ − θ)2]

55 / 96

If we believed that Xi were from a specific parametric model

e.g. F = Φ so that Xi ∼ N(µ, σ2)

then we could calculate the bias and standard error analytically.If these calculations were difficult or impossible (for example, ifθ = trimmed mean) then we can use simulations from F toestimate the standard error and bias of the statistics.

What if we don’t have a parametric model for F?

The bootstrap can be used to estimate the samplingdistribution in this case.

The idea is that instead of sampling from the population ofinterest, i.e. from F (·), we instead sample with replacementfrom the sample {x1, . . . , xn}, i.e. from F (·).

56 / 96

Example 1: Heart-attack study

A controlled, randomized, double-blind study was carried out toinvestigate whether or not aspirin reduces the risk of heartattacks in healthy middle-aged men. Data from the study is

heart attacks subjects(fatal plus non-fatal)

aspirin 104 11037placebo 189 11034

57 / 96

Heart-attack study -II

Define θ to be the true ratio of proportions of heart attacks inthose with aspirin to those with a placebo, the relative risk.

From the data, the estimate of θ suggests that aspirin lowersthe risk of a heart attack:

θ =104/11037

189/11034= 0.55

But how confident can we be?

Can we calculate a confidence interval for θ?

It is possible to derive a parametric confidence interval for θfrom theory by assuming that the log relative risk is normallydistributed. But what if we’ve forgotten how, or don’t wish toassume normality?

58 / 96

Heart-attack study -IIIBootstrapping enables us to derive these intervals withoutassuming that the log relative risks are normally distributed:

1. Estimate the probability p1 of a patient with aspirin havinga heart-attack:

p1 =104

11037= 0.00942

2. Estimate the probability p2 of a patient with a placebohaving a heart-attack:

p2 =189

11034= 0.0171

3. Simulate data for a new experiment: sample r1 fromBinomial(11037, 0.00942) and r2 fromBinomial(11034, 0.0171). The new data is known as abootstrap sample.

4. Obtain a new estimate of the ratio:

θ∗s =r1/11037

r2/1103459 / 96

Heart-attack study -IV

Steps 3 and 4 are then repeated a large number of times, toobtain a sample

{θ∗1, . . . , θB}

We can then use the 2.5th and 97.5th percentiles of this sampleas a 95% confidence interval for θ.

With B = 10000, performing this procedure in R gave a 95%interval of (0.43, 0.69) for θ.

We will now formally introduce the bootstrap and look at someexamples in detail.

60 / 96

The bootstrap

The basic idea behind the bootstrap is to find properties ofstatistics θ by resampling from F (rather than F ).

I If we could generate from F , we could simulate sample

data sets {X(i)1 , . . . , X

(i)n } for i = 1, . . . , B and find θ(i). We

can then learn properties of θ from {θ(1), . . . , θ(B)}.But usually we don’t know F and so can’t produce thesesamples. Instead we can bootstrap. This involves two ideas:

(i) Replace F by F .

(ii) We sample from F and find the properties of θ under F .

61 / 96

The bootstrap

The bootstrap algorithm

1. Generate B bootstrap replicates from F .

X∗(i) = {X∗(i)1 , . . . , X∗(i)n } for i = 1, . . . , B

2. Calculate B bootstrap parameter estimates

θ∗1, . . . , θ∗B

3. Calculate the property of interest for θ from {θ∗i }Bi=1 e.g.

seboot(θ) =

√EF (θ − θ)2 ≈

√1

B − 1

∑(θ∗i −

¯θ)2

where¯θ = 1

B

∑θ∗i

62 / 96

The bootstrap

We call iid samples of size n from F bootstrap replicates.

They can be generated by sampling with replacement from{x1, . . . , xn}.

In R this can be achieved by using the commandsample(n=1, size=n, data=x, replace=T)

63 / 96

The bootstrap estimate of standard errorSuppose θ(X) is some statistic based on X = {x1, . . . , xn} usedfor estimating parameter θ. The standard error of θ(X) is

se(θ) =

√VarF (θ(X))

Here the variance is with respect to distribution F . Thebootstrap estimate is found by

ii replacing F with F .

seF (θ)Op(

1√n)

≈ seF (θ)

iiii Approximating seF using simulation:

seF (θ)Op(

1√B)

≈

(1

B − 1

B∑b=1

(θ(X(b))− ¯θ)2

) 12

=: seboot

where X∗(b) = {X∗(b)1 , . . . , X∗(b)n } is a bootstrap sample

from {x1, . . . , xn} ie, se2boot is the variance of θ(X∗) when

X∗ are drawn from F .64 / 96

The bootstrap estimate of standard error - II

To make this more explicit, note that the first step is using theplug-in principle again.

If we consider the variance of θ to be a functional of F -

Var(θ)[F ] = EF (θ − EF (θ))2

then when we plug-in F we find

VarF (θ) = EF (θ − EF θ)2.

65 / 96

The bootstrap estimate of standard error - III

The second step is then to estimate seboot by simulation byreplacing VarF (θ(X∗)) with an estimate

VarF (θ(X∗)) ≈ Varboot(θ(X∗)) =

1

B − 1

B∑b=1

(θ(X∗(b))− ¯θ)2

where

¯θ =

1

B

B∑b=1

θ(X∗(b))

and whereX∗(b) = {X∗(b)1 , . . . , X∗(b)n }

are B bootstrap replicates from F .

66 / 96

Bootstrap estimate of bias

The bias of an estimator θ of parameter θ is defined as

bias = EF (θ)− θ

i.e., how the mean of the estimator of θ differs from the truevalue of θ.

An estimate is found by replacing F by F .

biasF = EF (θ)− θ

That is, the difference between the expected value and theestimated value.

This, again, is the plug-in principle.

67 / 96

Bootstrap estimate of bias - II

We can estimate EF (θ) from bootstrap samples as

EF (θ) ≈ 1

B

B∑b=1

θ∗(b)

giving us the bootstrap estimator of the bias to be

biasF (θ) ≈ biasboot(θ) =1

B

B∑b=1

θ∗(b) − θ

68 / 96

So in general there are two approximation steps in thebootstrap procedure:

1. Replace F by F .

2. Simulate from F to form the estimate of the property ofinterest.

The error in the first approximation scales with the number ofdata points (and so is fixed for any given problem). The errorin the second approximation scales with B, the number ofbootstrap replicates, and so can be controlled.

seF (θ)Op(

1√n)

≈ seF (θ)Op(

1√B)

≈ seboot(θ)

Here Yn = Op(xn) means that Yx/xn is stochastically bounded,i.e., for any ε > 0, there exists M > 0, such that for all n

P(|Yn/xn| > M) < ε

69 / 96

Lawschool exampleA sample of 15 law schools was taken, and two measurementswere made for each school:

xi : LSAT, average score for the class on a national law testyi : GPA, average undergraduate grade-point average for the class

We are interested in the correlation coefficient between thesetwo quantities, which we estimate to be θ = 0.776.

560 580 600 620 640 660

2.8

2.9

3.0

3.1

3.2

3.3

3.4

LSAT

GP

A

70 / 96

Lawschool example - II

But how accurate is this estimate of the correlation coefficient?We use the bootstrap to estimate the standard error ofθ = cor(LSAT,GPA).

1. Sample 15 data points with replacement from the observeddata z = {(x1, y1), . . . , (x15, y15)} to obtain new data z∗.

2. Evaluate the sample correlation coefficient θ∗ for the newlysampled data z∗.

3. Repeate steps 1 and 2 to obtain θ∗(1), . . . , θ∗(B).

4. Estimate the standard error of the sample correlationcoefficient by the sample standard deviation ofθ∗(1), . . . , θ∗(B).

71 / 96

Lawschool example - IIIWith B = 1000, I found the estimated standard error of θ is0.137. It can help to plot a histogram of the bootstrap replicatesas this gives more information about the distribution of θ.

0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Theta

72 / 96

Bootstrap confidence interval

We have two methods of calculating CIs.

1 Normal interval Given an estimate of the standard errorof θ, if we assume that the distribution of θ isapproximately normal, then an approximate 95%confidence interval is given by

θ ± 1.96se(θ∗)

I For the law dataset we find a 95% CI for cor(LSAT,GPA)of [0.51, 1.04] ≡ [0.51, 1.00].

This interval is not accurate unless the distribution ofbootstrap samples is approximately normal.

73 / 96

Bootstrap confidence interval - II

2 Percentile confidence intervalFor a 95% confidence interval, we need to find the twovalues l and u with

P(θ∗ > u) = 0.975

P(θ∗ < l) = 0.025

ie, we need to identify 2.5th and 97.5th percentiles fromthe distribution of θ∗. We can find this from the 2.5th and97.5th percentiles of the sample {θ∗(1), . . . , θ∗(B)}. Wegenerally need a larger value of B to get accurate percentileestimates than that required to find an accurate estimateof the standard error.

I For the law dataset we find a 95% CI of [0.45, 0.96].

74 / 96

Hypothesis testing with the bootstrap

Example: mice survival times

Treatment 94 197 16 38 99 141 23

Control 52 104 146 10 50 31 40 27 46

Is there a difference between two group means?

I Denote 7 treatment observations by x = {x1, . . . , x7}, and9 control observations by y = {y1, . . . , y9}.

I Could perform two-sample t test, assuming normallydistributed responses and equal variances in the twogroups.

I Define µX : population treatment mean, µY : populationcontrol mean. For one-sided test of H0 : µX = µY , observedp−value is 0.1405.

75 / 96

Bootstrap hypothesis test

I Alternative to assuming normality

I Denote FX : distribution of treatment survival time, FY :distribution of control survival time.

I Write null hypothesis as H0 : FX = FY = F , with F thesingle common distribution of all the responses.

I estimate F by F , empirical cdf of all 16 observations.

76 / 96

Bootstrap two-sample significance test

1. Sample 16 values with replacement from{x1, . . . , x7, y1, . . . , y9}.

2. Set {x∗1, . . . , x∗7} to be the first 7 sampled values, and{y∗1, . . . , y∗9} to be the remaining 9 sampled values.

3. Calculate the bootstrap test statistic

T ∗ =x∗ − y∗

σ∗√

1/7 + 1/9

for the sampled data.

4. Repeat steps 1 to 3 B times to obtain T ∗(1), . . . , T∗(B).

5. Estimate the significance of the observed Tobs by

1

N

N∑i=1

I{T ∗(i) ≥ Tobs} (5)

77 / 96

The Parametric bootstrap

Thus far we have been using the non-parametric bootstrap,ie,

I sample from F making no assumptions about thedistribution of the data.

The parametric bootstrap can be used when we believeF = Fθ, i.e. we have a parametric model for the data.

Then instead of sampling from F , we sample from Fθ.

In the mice example, we would replace step 1. on slide 77 by

1a. Estimate population mean µ and variance σ2, e.g.,

µ =1

16(∑

xi +∑

yi)

1b. Sample 16 values from a N(µ, σ2) distribution.

Steps 2-5 remain unchanged.

78 / 96

The Bootstrap and Regression

A formal regression type model has the structure

yi = f(xi, β) + εi

where f is a specified function acting on the covariates xi withparameters β, and εi is a realisation from a specified errorstructure. With this model framework, there are two alternativeways to bootstrap the model.

1. Fit the regression model, form the empirical distribution ofthe residuals, generate bootstrap replications of the databy substituting these back into the model, and re-fit themodel to obtain bootstrap distributions of β. This is calledmodel-based resampling.

2. Bootstrap from the pairs (xi, yi), re-fit the model to eachrealization, form the bootstrap distribution of β.

79 / 96

Model-based resampling

We will fit a model of the form

GPAi = β0 + β1LSATi + εi

to the law data. A least squares fit to these data givesβ0 = 0.3794 and β1 = 0.0045, but how accurate are thesevalues? We can perform the following set of steps to find thestandard error of these estimates

80 / 96

Model-based resampling - II

1. Find the fitted residuals

εi = GPAi − β0 − β1LSATi

2. Sample ε∗1, . . . , ε∗15 with replacement from {ε1, . . . , ε15}

3.Set GPA∗i = β0 + β1LSATi + ε∗i

4. Fit the least squares regression to{(LSAT1, GPA∗1), . . . , (LSAT15, GPA∗15)} to find estimatesβ = (β∗0 , β

∗1).

5. Repeat steps 2 to 4 B times to find bootstrap replicates

{(β∗(1)0 , β∗(1)1 ), . . . , (β

∗(B)0 , β

∗(B)1 )}

and use these replicates to estimate se(β0) and se(β1).

81 / 96

Model-based resampling - IIIUsing 1000 bootstrap replicates, I find the standard errors are

se(β0) = 0.586, se(β1) = 0.000973

and the plot shows a sample of 20 bootstrap regression lines.

560 580 600 620 640 660

2.8

2.9

3.0

3.1

3.2

3.3

3.4

LSAT

GP

A

82 / 96

Motorcycle exampleWe consider data providing measurements of accelerationagainst time for a simulated motorcycle accident. The data areshown in the figure.

10 20 30 40 50

−100

−50

050

times

accel

83 / 96

Motorcycle example

Clearly the relationship is nonlinear, and has structure that willnot easily be modelled parametrically. We use the loess()command in R to fit a locally weighted least-squares regressionline to the data (the details aren’t important for this course,but for completeness sake we set span=1/3 which determinesthe proportion of the data to be included in the moving windowwhich specifies which points are to be regressed upon.). Thefigure shows the best fit.Because of the non-parametric structure of the model, classicalapproaches to the assessment of parameter precision are notavailable. We can get a sense of how accurate the parametersare by using a bootstrapping scheme (of the first type). This isachieved by simply bootstrapping the pairs (x,y) in the originalplot and fitting loess curves to each simulated bootstrap series.A figure showing 20 bootstrap samples is shown below.

84 / 96

Motorcycle example

10 20 30 40 50

−100

−50

050

times

accel

R code is available in motorcycle.txt. 85 / 96

Summary

1. Monte Carlo testsI Will work with any test statistic and hypothesis, but

requires specification of distribution of data under nullhypothesis

I Only procedure out of three that produces ‘completely new’data.

2. Randomization testsI Can generally only handle tests of no treatment effect

between different treatment groups. One sample tests canbe performed, but under stricter assumptions.

I No distribution is required/assumed for the data, only thatallocation of subjects to treatment groups is random.

86 / 96

3 BootstrappingI Arguably the most widely applicable method of the three.I Main use is to construct confidence intervalsI Dependent on the empirical cdf being a good approximation

to the true distribution.I Accuracy ultimately depends on size of original sample.

87 / 96

2.4 Prediction errors and cross-validation

We fit models by minimizing some measure of error, eg, we fit alinear model by minimizing sum of squares

S(β) =1

n

n∑i=1

(yi − x>i β)2

When choosing between competing models we might betempted to take the model that achieves the lowest error rateon the training data.

However, the error we achieve on the training data is not thesame as the error we expect when predicting new data.

We need to be careful when choosing between models not toover-fit and choose a model that is too complex.

88 / 96

Example: Over-fittingSuppose we are given data {(x1, y1), . . . , (xn, yn)}

●

●

●

● ●

●●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

y

and we want to choose between

M1 : y = β0 + β1x+ ε M2 : y = β0 + β1x+ β2x2 + ε

...

Md : y = β0 + β1x+ . . .+ βdxd + ε 89 / 96

Example: Over-fitting

●

●

●

●

●

●●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

80

x

y

Linearquadratic9th order

The plot shows the fitted curves for M1,M2 and M9. Theresidual sum of squares is 676 (M1), 590 (M2), and 0 (M9).

With four parameters I can fit an elephant, andwith five I can make him wiggle his trunk. John vonNeumann

90 / 96

Example: Over-fittingM9 is a perfect fit to the training data - the residual sum ofsquares is 0.

I With n data points, we can always find a polynomial ofdegree n− 1 that fits perfectly.

But M9 is over-fit - it is modelling the noise not the signal andwould fail to accurately predict new data.

We know in general that fitting high order polynomials toregression data is not a sensible thing to do, but how can wedemonstrate this?

I Some methods adjust the training error to account for themodel complexity, e.g., AIC, BIC, Cp statistic.

Alternatively, in data-rich environments, we can simply splitthe data into a training set, and a test set. We fit the model onthe training-set, and then test its predictive accuracy on thetest set.

91 / 96

Training vs test set performance

Making a model more complex will always result in a better fitto the training data. But there is a bias-variance trade-off

I bias occurs from errors in the model structure, ie, frommodels that are too simplistic

I variance occurs from needing to estimate parameters - forcomplex models with many parameters, fitting can besensitive to small fluctuations in the training set leading usto fit the noise rather than the signal.

92 / 96

Cross-validation

Cross-validation is means of efficiently assessing predictiveaccuracy, and extends the idea of having a test and trainingdatasets.

Leave-one-out cross-validation (LOO-CV) For i = 1, . . . , n

1. Fit the model to the reduced data set (or training set),

{(x1, y1), . . . , (xi−1, yi−1), (xi+1, yi+1), . . . , (xn, yn)}

2. Obtain from the fitted model the predicted value yi at xi.

3. Compute the squared error ε2i = (yi − yi)2

An average squared prediction error can then be reported as1n

∑ni=1 ε

2i , or the root-mean-square (rms) prediction error as√

1n

∑ni=1 ε

2i .

All predictions are on held-out data (test data) and so thisgives us a measure of a model’s predictive skill.

93 / 96

k-fold cross-validation

Note that this is not the expected prediction error of the actualmodel (as we have only fit to n− 1 data points), though itshould be close if n is sufficiently large (so that the fit to n− 1points is very similar to the fit to n points).

This approach left out one point at a time, and is calledleave-one-out cross-validation.

K-fold cross-validation splits the data into K chunks ofapproximately equal size. Then for k = 1, . . . ,K,

I Delete chunk k from the data

I Fit the model to the rest of the data

I Use the fitted model to predict the data in chunk k andcompute the prediction error.

Setting K = n gives leave-one-out cross-validation!

94 / 96

If we do LOO-CV for our example, then we can plot thepredicted value against the true observed value for differentmodels. A perfect model would have predictions on the liney = x.

0 20 40 60 80 100

−10

0−

500

5010

0

true

pred

icte

d

23N

Linear6th orderx=y (perfect)

Linear6th orderx=y (perfect)

We can see that linear regression is much better in terms ofpredictive performance than the 6th degree polynomial.

The mean square prediction error for the straight line is 1065.8whereas for the 8th order polynomial it is 1936.

95 / 96

What K should we use in K-fold cross-validation?There is a variance-bias trade-off here too!

I The variance of our estimate of the predictive error growsas K gets larger

I For large K, e.g. LOO-CV with K = n, the data doesn’ttypically get shaken up enough. In LOO-CV each fold onlydiffers by two data points, and so estimates from each foldare highly correlated. Hence our estimate of the averageprediction error has a high variance (ie is unreliable).

I For small K, the folds are very different, and so the errorestimates are less correlated, and we get a stable estimate.

I The bias of our estimate of the predictive error shrinks asK gets larger

I Since each training set contains only K−1K n data points,

rather than n, the estimate of the prediction error is usuallybiased upwards (ie is too large)

I The bias is minimized for K = n, but this has high variance.

K = 5 or K = 10 are both usually considered good choices butit can vary between applications.The cvTools package in R can be used to do cross-validation.

96 / 96

Documents

New Chapter II Simulation methods in for classical statistics · 2020. 4. 25. · Introduction Classical statistical theory contains many methods for testing hypotheses in numerous