Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
'
&
$
%
Chapter 11
Simulation studies in statistics
1
'
&
$
%
Introduction
• What is a simulation study, and why do one?
• Simulations for properties of estimators
• Simulations for properties of hypothesis tests
2
'
&
$
%
What is a simulation study
Simulation
• A numerical technique for conducting experiments on the computer.
• It involves random sampling from probability distributions.
3
'
&
$
%
Rationale: (In statistics)
• Properties of statistical methods must be established so that the
methods may be used with confidence.
• Exact analytical derivations of properties are rarely possible.
• Large sample approximations to properties are often possible.
• But what happens in the finite sample size case?
• And what happens when assumptions are violated?
4
'
&
$
%
Usual issues
• Is an estimator biased in finite samples? Is it still consistent under
departures from assumptions? What is its sampling variance?
• How does it compare to competing estimators on the basis of bias,
precision, etc.?
• Does a procedure for constructing a confidence interval for a
parameter achieve the advertised nominal level of coverage?
5
'
&
$
%
Usual issues
• Does a hypothesis testing procedure attain the advertised level of
significance or size?
• If it does, what power is possible against different alternatives to
the null hypothesis? Do different test procedures deliver different
power?
How to answer these questions in the absence of analytical results?
6
'
&
$
%
Monte Carlo simulation approximation
• An estimator or test statistic has a true sampling distribution
under a particular set of conditions (finite sample size, true
distribution of the data, etc.)
• Ideally, we could want to know this true sampling distribution in
order to address the issues on the previous slide.
• But derivation of the true sampling distribution is not tractable.
• Hence we approximate the sampling distribution of an estimator or
a test statistic under a particular set of conditions.
7
'
&
$
%
How to approximate
A typical Monte Carlo simulation involves the following:
• Generate S independent data sets under the conditions of interest
• Compute the numerical value of the estimator/test statistic T from
each data set. Hence we have T1, T2, · · · , Ts, · · · , TS
8
'
&
$
%
How to approximate
• If S is large enough, summary statistics across T1, T2, · · · , Ts, · · · , TS
should be good approximations to the true properties of the
estimator/test statistic under the conditions of interest.
An example:
Consider an estimator for a parameter θ:
– Ts is the value of T from the s-th data set, s=1,2,· · · ,S.– The mean of all Ts’s is an estimate of the true mean of the
sampling distribution of the estimator.
9
'
&
$
%
Simulation for properties of estimators
A simple example:
• Compare three estimators for the mean µ of a distribution based on
a random sample Y1, Y2, · · · , Yn
• The three estimators are:
– Sample Mean, T (1)
– Sample Median, T (2)
– Sample 10% trimmed mean,T (3)
10
'
&
$
%
Simulation procedures
For a particular choice of µ, n and true underlying distribution
• Generate independent draws Y1, Y2, · · · , Yn from the distribution
• Compute T (1), T (2), T (3)
• Repeat S times. Hence we have
T(1)1 , T
(1)2 , · · · , T (1)
S
T(2)1 , T
(2)2 , · · · , T (2)
S
T(3)1 , T
(3)2 , · · · , T (3)
S
11
'
&
$
%
Simulation procedures
Compute for k = 1, 2, 3
ˆmean =1
S
S∑s=1
T (k)s = T (k)
ˆbias = T (k) − µ
SD =
√√√√ 1
S − 1
S∑s=1
(T(k)s − T (k))
2
ˆMSE =1
S
S∑s=1
(T (k) − µ)2 ≈ SD2+ ˆbias
2
12
'
&
$
%
A simulation study
• To compare 3 estimators for location µ through a simulation study
• The 3 estimators are
– Sample Mean,
– Sample Median,
– Sample 10% trimmed mean
• Underlying distribution: N(0,1)
• Sample size: 15
• Simulation size: 1000
13
'
&
$
%
A simulation study: R code
> # To compare 3 estimators for location,mu, through simulation study
> # The 3 estimators are sample mean, median, and 10% trimmed
mean.
> S=1000 # Simulation size (No.of samples)
> n=15 # Sample size
> mu=1 # Mean of the underlying normal distribution
> sd=1 # Standard deviation of the underlying normal distribution
14
'
&
$
%
A simulation study: R code
> meax=numeric(S) # A vector contains all the sample means
> medx=numeric(S) # A vector contains all the sample medians
> trmx=numeric(S) # A vector contains all the sample 10% trimmed
means
> stdx=numeric(S) # A vector contains all the sample standard
deviation
> set.seeds(1234) # Set the seed number
15
'
&
$
%
A simulation study: R code
> for (i in 1:S){+ x=rnorm(n,mu,sd) # Generate a random sample of size n
+ mean[i]=mean(x) # Compute the mean for the i-th
sample,i.e.T∧(1) i+ medx[i]=median(x)# Compute the median for the i-th
sample,i.e.T∧(2) i+ trmx[i]=mean(x,trim=0.1)
+# Compute the 10% trimmed mean for the i-th sample, i.e.T∧(3) i+ stdx[i]=sd(x)# Compute the standard deviation for the i-th sample
}
16
'
&
$
%
A simulation study: R code
> # Compute the mean of “S” sample means, medians, and trimmed
means
> simumean=apply(cbind(meax,medx,trmx),2,mean)
> # “2” in the second argument asks the computer to fine “means” for
all the columns in the matrix given in argument 1.
> # Hence “simumean” is a 3×1 vector consists of the average of the
“S” means, medians, and the 10% trimmed means.
17
'
&
$
%
A simulation study: R code
> # Compute the standard deviation of the “S” sample means,
medians, and 10% trimmed means
> simustd=apply(cbind(meax,medx,trmx),2,sd)
> # Compute the bias
> simubias=simumean-rep(mu,3)
> # Compute the MSE (Mean Square Error)
> simumse=simubias∧2+simustd∧2
18
'
&
$
%
A simulation study: R code
> ests=c(“Sample mean”,“Sample median”,“Sample 10% trimmed
mean”) # column heading in the output
> names=c(“True value”,“No. of simu”,“”MC Mean,“MC Std
Deviation”,“MC Bias”,“MC MSE”) # row heading in the output
>
sumdat=rbind(rep(mu,3),rep(S,3),simumean,simustd,simbias,simumse)
>
> dimnames(sumdat)=list(names,ests)
> round(sumdat,4)
19
'
&
$
%
A simulation study: R code
20
'
&
$
%
Checking coverage probability of confidence interval
• Usual 95% confidence interval for µ based on sample mean is given
by
(Y − t0.025,n−1s√n, Y + t0.025,n−1
s√n)
• Does the interval achieve the nominal level of coverage 95%?
21
'
&
$
%
Checking coverage probability of confidence interval
> #Check the coverage probability of confidence interval
> #We make use of the data obtained from the above simulation study
> t05=qt(0.975,n-1) #Get t {0.025,14}>coverage=sum((meax-
t05*stdx/sqrt(n)<=mu)&(meax+t05*stdx/sqrt(n)>=mu))/S
> #The above statement is equivalent to
> # d=0
> # for(i in 1:S){> #d=d+meax[i]-
t05*stdx[i]/sqrt(n)<=mu)&(meax[i]+t05*stdx[i]/sqrt(n)>=mu))}> #coverage=d/S
>coverage
[1]0.945
22
'
&
$
%
Simulation for properties of hypothesis tests
Example: Consider the size and power of the usual t-test for the mean
• Test H0 : µ = µ0 against H0 : µ = µ0
• To evaluate whether size/level of test achieves advertised α
– Generate data under H0 : µ = µ0 and calculate the proportion of
rejections of H0
• Approximation the true probability of rejecting H0 when it is true
– Proportion should be approximately equal to α
23
'
&
$
%
Simulation for properties of hypothesis tests
• To evaluate power
– Generate data under some alternative µ = µ0 (let say, µ = µ1)
and calculate the proportion of rejections of H0
• Approximate the true probability of rejecting H0 when the
alternative is true
24
'
&
$
%
Checking the size of t-test: R code
> #Check the size of t-test
>ttests=(meax-mu0)/(stdx/sqrt(n))
>size=sum(abs(ttests)>t05)/S
>size
[1]0.045
25
'
&
$
%
Checking the power of t-test: R code
> #Check the power of t-test when d=1*sd
> Assume the samples generated from N(µ1, σ) where µ1 = 1, σ = 1.
> Test H0 : µ = µ0 vs H1 : µ = µ1 = µ0 + σ where µ0 = 0.
>ttests=(meax-mu0)/(stdx/sqrt(n))
>power=sum(abs(ttests)>t05)/S
>power
[1]0.954
26
'
&
$
%
Simulation study in SAS
Simulation to study the size of t-test
*Generate “ns” normal random samples;
* “mcrep” (M.C. Replications) is the index for the s-th random sample;
data simu;
seed=123;
ns=1000;n=25;mu=0;sigma=1;
do mcrep=1 to ns;
do i=1 to n;
x=rannor(seed);
output; end;
keep mcrep x;
end;
run;
27
'
&
$
%
Simulation study in SAS
proc sort data=simu;
by mcrep;
run;
*Perform the t-test for each sample.
Output the p-value “probt” in the SAS dataset “outtset”;
proc univariate data=simu noprint mu0=0;
by mcrep;
var x;
output out=outtest probt=p;
run;
28
'
&
$
%
Simulation study in SAS
* Count how many samples have “probt”<0.05;
data outtest;
set outtest;
reject=(p<0.05);
run;
* “reject” is the number of samples with “probt”<0.05. Hence the
mean of “reject” is the rejection rate,“rejrate”;
29
'
&
$
%
Simulation study in SAS
proc means data=outtest noprint;
var reject;
output out=results mean=rejrate;
proc print data=results;
var freq rejrate;
run;
30