Chapter 11 Simulation studies in statistics - NUS - Home

'

&

$

%

Chapter 11

Simulation studies in statistics

1

'

&

$

%

Introduction

• What is a simulation study, and why do one?

• Simulations for properties of estimators

• Simulations for properties of hypothesis tests

2

'

&

$

%

What is a simulation study

Simulation

• A numerical technique for conducting experiments on the computer.

• It involves random sampling from probability distributions.

3

'

&

$

%

Rationale: (In statistics)

• Properties of statistical methods must be established so that the

methods may be used with confidence.

• Exact analytical derivations of properties are rarely possible.

• Large sample approximations to properties are often possible.

• But what happens in the finite sample size case?

• And what happens when assumptions are violated?

4

'

&

$

%

Usual issues

• Is an estimator biased in finite samples? Is it still consistent under

departures from assumptions? What is its sampling variance?

• How does it compare to competing estimators on the basis of bias,

precision, etc.?

• Does a procedure for constructing a confidence interval for a

parameter achieve the advertised nominal level of coverage?

5

'

&

$

%

Usual issues

• Does a hypothesis testing procedure attain the advertised level of

significance or size?

• If it does, what power is possible against different alternatives to

the null hypothesis? Do different test procedures deliver different

power?

How to answer these questions in the absence of analytical results?

6

'

&

$

%

Monte Carlo simulation approximation

• An estimator or test statistic has a true sampling distribution

under a particular set of conditions (finite sample size, true

distribution of the data, etc.)

• Ideally, we could want to know this true sampling distribution in

order to address the issues on the previous slide.

• But derivation of the true sampling distribution is not tractable.

• Hence we approximate the sampling distribution of an estimator or

a test statistic under a particular set of conditions.

7

'

&

$

%

How to approximate

A typical Monte Carlo simulation involves the following:

• Generate S independent data sets under the conditions of interest

• Compute the numerical value of the estimator/test statistic T from

each data set. Hence we have T1, T2, · · · , Ts, · · · , TS

8

'

&

$

%

How to approximate

• If S is large enough, summary statistics across T1, T2, · · · , Ts, · · · , TS

should be good approximations to the true properties of the

estimator/test statistic under the conditions of interest.

An example:

Consider an estimator for a parameter θ:

– Ts is the value of T from the s-th data set, s=1,2,· · · ,S.– The mean of all Ts’s is an estimate of the true mean of the

sampling distribution of the estimator.

9

'

&

$

%

Simulation for properties of estimators

A simple example:

• Compare three estimators for the mean µ of a distribution based on

a random sample Y1, Y2, · · · , Yn

• The three estimators are:

– Sample Mean, T (1)

– Sample Median, T (2)

– Sample 10% trimmed mean,T (3)

10

'

&

$

%

Simulation procedures

For a particular choice of µ, n and true underlying distribution

• Generate independent draws Y1, Y2, · · · , Yn from the distribution

• Compute T (1), T (2), T (3)

• Repeat S times. Hence we have

T(1)1 , T

(1)2 , · · · , T (1)

S

T(2)1 , T

(2)2 , · · · , T (2)

S

T(3)1 , T

(3)2 , · · · , T (3)

S

11

'

&

$

%

Simulation procedures

Compute for k = 1, 2, 3

ˆmean =1

S

S∑s=1

T (k)s = T (k)

ˆbias = T (k) − µ

SD =

√√√√ 1

S − 1

S∑s=1

(T(k)s − T (k))

2

ˆMSE =1

S

S∑s=1

(T (k) − µ)2 ≈ SD2+ ˆbias

2

12

'

&

$

%

A simulation study

• To compare 3 estimators for location µ through a simulation study

• The 3 estimators are

– Sample Mean,

– Sample Median,

– Sample 10% trimmed mean

• Underlying distribution: N(0,1)

• Sample size: 15

• Simulation size: 1000

13

'

&

$

%

A simulation study: R code

> # To compare 3 estimators for location,mu, through simulation study

> # The 3 estimators are sample mean, median, and 10% trimmed

mean.

> S=1000 # Simulation size (No.of samples)

> n=15 # Sample size

> mu=1 # Mean of the underlying normal distribution

> sd=1 # Standard deviation of the underlying normal distribution

14

'

&

$

%


> meax=numeric(S) # A vector contains all the sample means

> medx=numeric(S) # A vector contains all the sample medians

> trmx=numeric(S) # A vector contains all the sample 10% trimmed

means

> stdx=numeric(S) # A vector contains all the sample standard

deviation

> set.seeds(1234) # Set the seed number

15

'

&

$

%


> for (i in 1:S){+ x=rnorm(n,mu,sd) # Generate a random sample of size n

+ mean[i]=mean(x) # Compute the mean for the i-th

sample,i.e.T∧(1) i+ medx[i]=median(x)# Compute the median for the i-th

sample,i.e.T∧(2) i+ trmx[i]=mean(x,trim=0.1)

+# Compute the 10% trimmed mean for the i-th sample, i.e.T∧(3) i+ stdx[i]=sd(x)# Compute the standard deviation for the i-th sample

}

16

'

&

$

%


> # Compute the mean of “S” sample means, medians, and trimmed

means

> simumean=apply(cbind(meax,medx,trmx),2,mean)

> # “2” in the second argument asks the computer to fine “means” for

all the columns in the matrix given in argument 1.

> # Hence “simumean” is a 3×1 vector consists of the average of the

“S” means, medians, and the 10% trimmed means.

17

'

&

$

%


> # Compute the standard deviation of the “S” sample means,

medians, and 10% trimmed means

> simustd=apply(cbind(meax,medx,trmx),2,sd)

> # Compute the bias

> simubias=simumean-rep(mu,3)

> # Compute the MSE (Mean Square Error)

> simumse=simubias∧2+simustd∧2

18

'

&

$

%


> ests=c(“Sample mean”,“Sample median”,“Sample 10% trimmed

mean”) # column heading in the output

> names=c(“True value”,“No. of simu”,“”MC Mean,“MC Std

Deviation”,“MC Bias”,“MC MSE”) # row heading in the output

>

sumdat=rbind(rep(mu,3),rep(S,3),simumean,simustd,simbias,simumse)

>

> dimnames(sumdat)=list(names,ests)

> round(sumdat,4)

19

'

&

$

%


20

'

&

$

%

Checking coverage probability of confidence interval

• Usual 95% confidence interval for µ based on sample mean is given

by

(Y − t0.025,n−1s√n, Y + t0.025,n−1

s√n)

• Does the interval achieve the nominal level of coverage 95%?

21

'

&

$

%

Checking coverage probability of confidence interval

> #Check the coverage probability of confidence interval

> #We make use of the data obtained from the above simulation study

> t05=qt(0.975,n-1) #Get t {0.025,14}>coverage=sum((meax-

t05*stdx/sqrt(n)<=mu)&(meax+t05*stdx/sqrt(n)>=mu))/S

> #The above statement is equivalent to

> # d=0

> # for(i in 1:S){> #d=d+meax[i]-

t05*stdx[i]/sqrt(n)<=mu)&(meax[i]+t05*stdx[i]/sqrt(n)>=mu))}> #coverage=d/S

>coverage

[1]0.945

22

'

&

$

%

Simulation for properties of hypothesis tests

Example: Consider the size and power of the usual t-test for the mean

• Test H0 : µ = µ0 against H0 : µ = µ0

• To evaluate whether size/level of test achieves advertised α

– Generate data under H0 : µ = µ0 and calculate the proportion of

rejections of H0

• Approximation the true probability of rejecting H0 when it is true

– Proportion should be approximately equal to α

23

'

&

$

%

Simulation for properties of hypothesis tests

• To evaluate power

– Generate data under some alternative µ = µ0 (let say, µ = µ1)

and calculate the proportion of rejections of H0

• Approximate the true probability of rejecting H0 when the

alternative is true

24

'

&

$

%

Checking the size of t-test: R code

> #Check the size of t-test

>ttests=(meax-mu0)/(stdx/sqrt(n))

>size=sum(abs(ttests)>t05)/S

>size

[1]0.045

25

'

&

$

%

Checking the power of t-test: R code

> #Check the power of t-test when d=1*sd

> Assume the samples generated from N(µ1, σ) where µ1 = 1, σ = 1.

> Test H0 : µ = µ0 vs H1 : µ = µ1 = µ0 + σ where µ0 = 0.

>ttests=(meax-mu0)/(stdx/sqrt(n))

>power=sum(abs(ttests)>t05)/S

>power

[1]0.954

26

'

&

$

%

Simulation study in SAS

Simulation to study the size of t-test

*Generate “ns” normal random samples;

* “mcrep” (M.C. Replications) is the index for the s-th random sample;

data simu;

seed=123;

ns=1000;n=25;mu=0;sigma=1;

do mcrep=1 to ns;

do i=1 to n;

x=rannor(seed);

output; end;

keep mcrep x;

end;

run;

27

'

&

$

%


proc sort data=simu;

by mcrep;

run;

*Perform the t-test for each sample.

Output the p-value “probt” in the SAS dataset “outtset”;

proc univariate data=simu noprint mu0=0;

by mcrep;

var x;

output out=outtest probt=p;

run;

28

'

&

$

%


* Count how many samples have “probt”<0.05;

data outtest;

set outtest;

reject=(p<0.05);

run;

* “reject” is the number of samples with “probt”<0.05. Hence the

mean of “reject” is the rejection rate,“rejrate”;

29

'

&

$

%


proc means data=outtest noprint;

var reject;

output out=results mean=rejrate;

proc print data=results;

var freq rejrate;

run;

30

Documents

Chapter 11 Simulation studies in statistics - NUS - Home