28
Statistics Lab Rodolfo Metulini IMT Institute for Advanced Studies, Lucca, Italy Lesson 3 - Point Estimate, Confidence Interval and Hypotesis Tests - 16.01.2014

Point Estimate, Confidence Interval, Hypotesis tests

Embed Size (px)

Citation preview

Page 1: Point Estimate, Confidence Interval, Hypotesis tests

Statistics Lab

Rodolfo Metulini

IMT Institute for Advanced Studies, Lucca, Italy

Lesson 3 - Point Estimate, Confidence Interval and HypotesisTests - 16.01.2014

Page 2: Point Estimate, Confidence Interval, Hypotesis tests

Introduction

Let’s start having empirical data (one variable of length N)extracted from external file, suppose to consider it to be thepopulation. We define a sample of size n.

Suppose we do not have information on population (or, better, wewant to check if and how the sample can represent thepopulation)

We, in other words, want to make infererence using theinformation contained in the sample, in order to obtain anestimation for the population.

That sample is one of several samples we can randomly draw fromthe population (the sample space).

What are the instruments to obtain infos about the population?(1) Sample mean (point estimation) (2) Confidence interval (3)Hypotesis tests

Page 3: Point Estimate, Confidence Interval, Hypotesis tests

Sample space

In probability theory, the sample space of an experiment or randomtrial is the set of all possible outcomes or results of thatexperiment.

It is common to refer to a sample space by the labels S , Ω, orU.

For example, for tossing two coins, the corresponding sample spacewould be (head,head), (head,tail), (tail,head), (tail,tail), so thatthe dimension is 4. dim(Ω) = 4. It means that we can obtain 4different samples with corresponding 4 different samplemeans.

In pratice, we face up with only one sample took at random fromthe sample space.

Page 4: Point Estimate, Confidence Interval, Hypotesis tests

Point estimate

Point estimate permit us to summarize the information containedin the population (dimension N), throughout only 1 valueconstructed using n vales.

The most used, unbiased point estimator is the sample mean.

X n =

∑n1=1 xi

n

Other point estimators are: (1) Sample Median (2) Sample Mode(3) Geometric mean.

Geometric Mean = Mg =√∏n

i=1 xi2

= exp[ 1n∑n

1=1 lnxi ]

An example of what is not an estimator is when you use thesample mean after subsetting the sample truncating it on a certainvalue.

P.S. A Naif definition of estimator: when the estimator iscomputed using all the n informations in the sample.

Page 5: Point Estimate, Confidence Interval, Hypotesis tests

Efficient estimators

The BLUE (Best Linear Unbiased Estimator) is defined asfollow:

1. is a linear function of all the sample values

2. is unbiased (E (Xn) = θ)

3. has the smallest sample variance among all unbiasedestimators.

The sample mean is BLUE for the parameter µ

Some estimators are biased but consistent: An estimator isconsistent when become unbiased for n −→∞

Page 6: Point Estimate, Confidence Interval, Hypotesis tests

Point estimators - cases

I Normal samples: Xn is the BLUE estimator for µ parameter(mean)

I Bernoulli samples f (x) = ρx (1− ρ)1−x : Xn is a unbiasedestimator for ρ parameter (frequency)

I Poisson samples f (x) =e−kkx

x!): Xn is a unbiased estimator

for k parameter (which represent both mean and variance ofthe distribution)

I Exponential samples f (x) = λe−λy )1

Xn

:is a unbiased

estimator for λ parameter (density at value 0)

Page 7: Point Estimate, Confidence Interval, Hypotesis tests

Confidence interval theory

With point estimators we make use of only one value to inferabout population.

With confidence interval we define a minimum and a maximumvalue in which the population parameter we expect to lie.Formally, we need to calculate:

µ1 = Xn − z ∗ σ√n

µ2 = Xn + z ∗ σ√n

and we end up with interval µ = µ1;µ2

Here: Xn is the sample mean; z is the upper (or lower) criticalvalue of the theoretical distribution. σ is the standard deviation ofthe theoretical distribution. n the sample size.

(See the graph)

Page 8: Point Estimate, Confidence Interval, Hypotesis tests

Confidence interval theory - Gaussian

We will make some assumptions for what we might find in anexperiment and find the resulting confidence interval using anormal distribution.

Let assume that the sample mean is 5, the standard deviation inpopulation is known and it is equal to 2, and the sample size isn = 20. In the example below we will use a 95 per cent confidencelevel and wish to find the confidence interval.

N.B. Here, since the confidence interval is 95, the z (the criticalvalue) to consider is the one corresponding with CDF (i.e. dnorm)= 0.975.

We also can speak of α = 0.05, or 1− α = 0.95, or1− α/2 = 0.975

Page 9: Point Estimate, Confidence Interval, Hypotesis tests

Confidence interval theory - T-student

We use T − student distribution when n is small and sd isunknown in population. We need to use a sample variance

estimation: σ =

√∑(xi−Xn)2

n−1

The t-student distribution is more spread out.

In simple words, since we do not know the population sd , we needfor more large intervals (caution - approach).

The only difference with normal distribution, is that we use thecommand associated with the t-distribution rather than the normaldistribution. Here we repeat the procedures above, but we willassume that we are working with a sample standard deviationrather than an exact standard deviation.

N.B. The T distribution is characterize by its degree of freedom. Inthis test the degree aere equal to n − 1, because we use 1estimation (1 constraint)

Page 10: Point Estimate, Confidence Interval, Hypotesis tests

Confidence interval theory - comparison of two means

In some case we can have an experiment called (for example)case-control.

Let’s imagine to have the population splitted in 2: one is thetreated group, the second is the non treated group.

Suppose to extract two samples from them with aim to test if thetwo samples comes from a population with the same meanparameter (is the treatment effective?)

The output of this test will be a confidence interval represting thedifference between the two means.

N.B. Here, the degree of freedom of the t-distribution are equal tomin(n1, n2)− 1

Page 11: Point Estimate, Confidence Interval, Hypotesis tests

Formulas

I Gaussian confidence interval:

µ = µ1, µ2 = Xn ± z ∗ σ√n

I T - student confidence interval:

µ = µ1, µ2 = Xn ± tn−1 ∗ σ√n

I T-student confidence interval for two sample difference:

µdiff = µdiff 1, µdiff2 = (X1 − X2)± tn−1 ∗ sd ;

where sd = sd1 ∗ sd1n1

+ sd2 ∗ sd2n2

I Gussian confidence interval for proportion (bernoullidistribution):

ρ = ρ1, ρ2 = f1 ± z ∗ sd ;

where sd =√

ρ(1−ρ)n2

Page 12: Point Estimate, Confidence Interval, Hypotesis tests

Hypotesis testing

Researchers retain or reject hypothesis based on measurements ofobserved samples.

The decision is often based on a statistical mechanism calledhypothesis testing.

A type I error is the mishap of falsely rejecting a null hypothesiswhen the null hypothesis is true.

The probability of committing a type I error is called thesignificance level of the hypothesis testing, and is denoted by theGreek letter α (the same used in the confidence intervals).

We demonstrate the procedure of hypothesis testing in R first withthe intuitive critical value approach.

Then we discuss the popular p − value (and very quick) approachas alternative.

Page 13: Point Estimate, Confidence Interval, Hypotesis tests

Hypotesis testing - lower tail

The null hypothesis of the lower tail test of the population meancan be expressed as follows:

µ ≥ µ0; where µ0 is a hypothesized lower bound of the truepopulation mean µ.

Let us define the test statistic z in terms of the sample mean, thesample size and the population standard deviation σ:

z = Xn−µ0σ/√

n

Then the null hypothesis of the lower tail test is to be rejected ifz ≤ zα , where zα is the 100(α) percentile of the standard normaldistribution.

Page 14: Point Estimate, Confidence Interval, Hypotesis tests

Hypotesis testing - upper tail

The null hypothesis of the upper tail test of the population meancan be expressed as follows:

µ ≤ µ0; where µ0 is a hypothesized upper bound of the truepopulation mean µ.

Let us define the test statistic z in terms of the sample mean, thesample size and the population standard deviation σ:

z = Xn−µ0σ/√

n

Then the null hypothesis of the upper tail test is to be rejected ifz ≥ z1−α , where z1−α is the 100(1− α) percentile of thestandard normal distribution.

Page 15: Point Estimate, Confidence Interval, Hypotesis tests

Hypotesis testing - two tailed

The null hypothesis of the two-tailed test of the population meancan be expressed as follows:

µ = µ0; where µ0 is a hypothesized value of the true populationmean µ. Let us define the test statistic z in terms of the samplemean, the sample size and the population standard deviationσ:

z = Xn−µ0σ/√

n

Then the null hypothesis of the two-tailed test is to be rejected ifz ≤ zα/2 or z ≥ z1−α/2 , where zα/2 is the 100(α/2) percentile ofthe standard normal distribution.

Page 16: Point Estimate, Confidence Interval, Hypotesis tests

Hypotesis testing - lower tail with Unknown variance

The null hypothesis of the lower tail test of the population meancan be expressed as follows:

µ ≥ µ0; where µ0 is a hypothesized lower bound of the truepopulation mean µ.

Let us define the test statistic t in terms of the sample mean, thesample size and the sample standard deviation σ:

t = Xn−µ0σ/√

n

Then the null hypothesis of the lower tail test is to be rejected ift ≤ tα , where tα is the 100(α) percentile of the Student tdistribution with n − 1 degrees of freedom.

Page 17: Point Estimate, Confidence Interval, Hypotesis tests

Hypotesis testing - upper tail with Unknown variance

The null hypothesis of the upper tail test of the population meancan be expressed as follows:

µ ≤ µ0; where µ0 is a hypothesized upper bound of the truepopulation mean µ.

Let us define the test statistic t in terms of the sample mean, thesample size and the sample standard deviation σ:

t = Xn−µ0σ/√

n

Then the null hypothesis of the upper tail test is to be rejected ift ≥ t1−α , where t1−α is the 100(1− α) percentile of the Studentt distribution with n1 degrees of freedom.

Page 18: Point Estimate, Confidence Interval, Hypotesis tests

Hypotesis testing - two tailed with Unknown variance

The null hypothesis of the two-tailed test of the population meancan be expressed as follows:

µ = µ0; where µ0 is a hypothesized value of the true populationmean µ. Let us define the test statistic t in terms of the samplemean, the sample size and the sample standard deviation σ:

t = Xn−µ0σ/√

n

Then the null hypothesis of the two-tailed test is to be rejected ift ≤ tα/2 or t ≥ t1−α/2 , where tα/2 is the 100(α/2) percentile ofthe Student t distribution with n − 1 degrees of freedom.

Page 19: Point Estimate, Confidence Interval, Hypotesis tests

Lower Tail Test of Population Proportion

The null hypothesis of the lower tail test about populationproportion can be expressed as follows:

ρ ≥ ρ0; where ρ0 is a hypothesized lower bound of the truepopulation proportion ρ.

Let us define the test statistic z in terms of the sample proportionand the sample size:

z = ρ−ρ0√ρ0(1−ρ0)

n

Then the null hypothesis of the lower tail test is to be rejected ifz ≤ zα , where zα is the 100(α) percentile of the standard normaldistribution.

Page 20: Point Estimate, Confidence Interval, Hypotesis tests

Upper Tail Test of Population Proportion

The null hypothesis of the upper tail test about populationproportion can be expressed as follows:

ρ ≤ ρ0; where ρ0 is a hypothesized lower bound of the truepopulation proportion ρ.

Let us define the test statistic z in terms of the sample proportionand the sample size:

z = ρ−ρ0√ρ0(1−ρ0)

n

Then the null hypothesis of the lower tail test is to be rejected ifz ≥ z1−α , where z1−α is the 100(1− α) percentile of the standardnormal distribution.

Page 21: Point Estimate, Confidence Interval, Hypotesis tests

Two Tailed Test of Population Proportion

The null hypothesis of the upper tail test about populationproportion can be expressed as follows:

ρ = ρ0; where ρ0 is a hypothesized true populationproportion.

Let us define the test statistic z in terms of the sample proportionand the sample size:

z = ρ−ρ0√ρ0(1−ρ0)

n

Then the null hypothesis of the lower tail test is to be rejected ifz ≤ zα/2 or z ≥ z1−α/2

Page 22: Point Estimate, Confidence Interval, Hypotesis tests

Sample size definition

The quality of a sample survey can be improved (worsened) byincreasing (decreasing) the sample size.

The formula below provide the sample size needed under therequirement of population proportion interval estimate at (1− α)confidence level, margin of error E and planned parameterestimation.

Here, z1−α/2 is the 100(1− α/2) percentile of the standard normaldistribution.

I For mean: n =z21−α/2∗σ

2

E2

I For proportion: n =z21−α/2ρ∗(1−ρ)

E2

Page 23: Point Estimate, Confidence Interval, Hypotesis tests

Sample size definition - Exercises

I Mean: Assume the population standard deviation σ of thestudent height in survey is 9.48. Find the sample size neededto achieve a 1.2 centimeters margin of error at 95 per centconfidence level.

Since there are two tails of the normal distribution, the 95 percent confidence level would imply the 97.5th percentile of thenormal distribution at the upper tail. Therefore, z1−α/2 isgiven by qnorm(.975).

I Population: Using a 50 per cent planned proportion estimate,find the sample size needed to achieve 5 per cent margin oferror for the female student survey at 95 per cent confidencelevel.

Since there are two tails of the normal distribution, the 95 percent confidence level would imply the 97.5th percentile of thenormal distribution at the upper tail. Therefore, z1−α/2 isgiven by qnorm(.975).

Page 24: Point Estimate, Confidence Interval, Hypotesis tests

Homeworks

1: Confidence interval for the proportion. Suppose we have asample of size n = 25 of births. 15 of that are female. Define theinterval (at 99 per cent) for the proportion of female in thepopulation. HINT: Apply with the proper functions in R, theformula in slide 11.

2: Hypotesis test to compare two proportions. Suppose we havetwo schools. Sampling from the first, n = 20 and the Hispanicsstudents are 8. Sampling from the second, n = 18 and Hispanicsstudents are 4. Can we state (at 95 per cent) the frequency ofHispanics are the same in the two schools? N.B.: the test here istwo tailed.

The hypotesis test here is:

z = ρ1−ρ2sd ; where sd =

√ρ(1− ρ)[ 1

n1+ 1

n2];

ρ = (ρ1∗n1+ρ2+n2)n1+n2

Page 25: Point Estimate, Confidence Interval, Hypotesis tests

Charts - 1

Figure: Representation of the critical point for the upper tail hypotesistest

Page 26: Point Estimate, Confidence Interval, Hypotesis tests

Charts - 2

Figure: Representation of the critical point for the lower tail hypotesistest

Page 27: Point Estimate, Confidence Interval, Hypotesis tests

Charts - 3

Figure: Representation of the critical point for the two-tailed hypotesistest

Page 28: Point Estimate, Confidence Interval, Hypotesis tests

Charts - 4

Figure: Type I and Type II errors in hypotesis testing