Basic Statistics Introduction to Inferential Statistics

Preview:

Citation preview

Basic Statistics

Introduction to Inferential Statistics

STRUCTURE OF STATISTICS

STATISTICS

DESCRIPTIVE

INFERENTIAL

TABULAR

GRAPHICAL

NUMERICAL

ESTIMATION

TESTS OF HYPOTHESIS

Introduction to Inferential Statistics

• Inferential statistics about the population mean are usually used to answer one of two types of questions.– The first question is, What is the average

“something?” This is Estimation.

• “Something” could be hours spend studying by online students, speed driven by teenagers, distance people commute to work or school, or any number of other things.

Introduction to Inferential Statistics

• The second type question about the population mean is:– “Am I right or wrong if I guess (hypothesize) the

mean “something” to be {value}? This is Hypothesis Testing.

• Again, “something” could be hours spend studying by online students (10 hours), speed driven by teenagers (too fast*), distance people commute to work or school (12 miles), or any number of other things. *The hypothesized value must be a value not a value

judgment!

Inferential Statistics

C on fide nce In te rva ls

E s tim ation

t- te s ts

H ypo th es is T es ting

In fe re n tia l S ta tist ics

Relating to the Textbook

• Your textbook treats these two types of questions as distinctly different, with the Hypothesis Testing taking a predominate role.

• I see them as closely linked and in fact, I will show you how to do both things with one technique.

REMINDER!!

• Much of what we will cover from here until the end of the course is not in sequence with your book. The material is all there but I will be referring you to many sections of many chapters as we progress. You will need to pay careful attention to the PowerPoint lessons and be able to use your textbook as a reference.

Some Definitions for Estimation

• Estimation: Using sample statistics to estimate population parameters.

• Point Estimate: Use of a single number as the estimate for unknown parameter (usually never correct!).

• Interval Estimate: A range of values as the estimate for the unknown parameter.

• Confidence Interval: An interval estimate accompanied by a specific level of probability.

An Example of EstimationSuppose a university administrator is interested in determining the average IQ of all professors at her university. It is too costly to test all professors, so she selects a random sample of 20 professors. Each is given an IQ test and the results show a sample mean of 135. Since the test is nationally standardized, she knows that for the population is 15.

How would the administrator estimate the average IQ for ALL university professors?

Constructing a Confidence Interval

• The general formula for a confidence interval uses information from the sample and our knowledge of the sampling distribution from the Central Limit Theorem.

• We then construct an interval in which we think the population parameter will be.

Confidence Interval Formula

n

σZXCI

In words, the confidence interval is determined by adding and subtracting the bound on the estimate (the z score representing the level of confidence times the standard error) to and from the mean from the sample.

The Area Under the Normal Curve and the Sampling Distribution of the Means

X x

96.1 x

96.1

95%

The Sampling Distribution of the Means

• From the previous slide we can see that given our knowledge of the sampling distribution of the means, we know that 95% of all that we would obtain from numerous samples will fall with 1.96 standard errors of the unknown .

sX '

x

Our University Administrator Let’s return to our university administrator and

see how she will estimate the average IQ [m] of all professors at her university.

She will need to know the

Shape,

Mean, and

Standard Deviation of the

Sampling Distribution!

What about the Shape of the Sampling Distribution?

If we had repeated taking hundreds of samples of 20 professors, what does the CLT tell us will be the shape of the distribution of sample means from these samples?

It would be approximately normal or mound-shaped.

What about the Mean of the Sampling Distribution?

What does the CLT tell us the mean of the sampling distribution would be?

It would be the same as the population, which is and is unknown.

What about the Standard Deviation of

the Sampling Distribution? What does the CLT tell us the standard deviation

(standard error) of the sampling distribution would be?

It would be the same as the population standard deviation, 15, divided by the square root of the sample size, 20.

3.3520

15n

σσ XX

We can display this graphically

X

3.3520

15n

σσ XX

3.353.35

Using the standard deviation of the sampling distribution (standard error), we can determine a bound on the estimate.

We know that 95% of the observations in a distribution will fall within 1.96 standard deviations of the mean, therefore, if we take 1.96 of the standard errors (1.96 x 3.35 = 6.57), we know the maximum distance that our estimate will miss the population parameter (error) 95% of the time.

Using this information, we can determine the points on this graph where the sample

mean would occur 95% of the time:

57.6)35.396.1( 57.6)35.396.1(

X57.6 57.6

95%

6.57 6.57

Let’s illustrate the computations--

First, we compute the bound on the error of estimation:

57.620

1596.196.196.1 n

XX

We then subtract and add it to the sample mean:

4.12820

1596.113596.1

nX X

6.14120

1596.113596.1

nX X

The Answer!

• Based on our calculations, the way to state the estimate is:

The administrator is 95% confident that the mean IQ for all professors at her university is between 128.4 and 141.6.

We can show graphically the concept of the confidence interval.

X57.6 57.6

Since there is a 95% chance that the sample mean will be in this interval, the interval around the sample mean will capture the population mean () 95% of the time.

Important Concept

• When we construct a confidence interval, we are not saying that the parameter is in the middle but merely somewhere in that interval!! It is like throwing a net into the sea, we hope to catch the fish but we do not know where the fish is. If we are really hungry, we better throw a big net! (Which statistically is to have a higher degree of confidence).

How often will the 95% confidence interval capture ?

X57.6 57.6

Answer: 95% of the time

Here, the sample mean is as far left as it will fall 95% of the time. Please note that it still captures (barely) the population mean.

X57.6 57.6

Here, the sample mean is as far right as it will fall 95% of the time. Please note that it still captures (barely)

the population mean.

X57.6 57.6

Only 5 times in 100 samples will the obtained sample mean be so far away that a 95%

confidence interval will not capture .

X57.6 57.6

X

Summary

The sample mean is the point estimate for the population mean.

The standard deviation of the sampling distribution is also called the standard error for the estimate of the mean.

1.96 standard errors provides the 95% bound on the error of estimation.If we add and subtract this bound from the sample mean, we can create a confidence interval.

Finally, we can alter the confidence limits (from 95%) depending on the distance from the mean of the distribution that we choose.

Summary in Symbols for Estimating μ

XEstimatePoint

nEstimation ofError

nzEstimation ofError on Bound

Note: z would be 1.96 for 95% Bound, 2.575 for 99% Bound, 1.64 for 90% Bound, etc.

Confidence Interval:

nzXto

nzX

X

population

sample

One–Sample Test of Hypothesis Hypothesis Testing on a Population Mean

A particular test has a national mean and standard deviation of 100 and 15 respectively. The superintendent of a particular school system wants to know if the average IQ in her school system is different than the national average on this test.

population

Research Situation

Definitions Related to Hypothesis Testing

• Null Hypothesis: The hypothesis that we will test statistically. In a single sample problem it is the “guess” about the population mean ().– Written as: Ho: = value.

• Alternative Hypothesis: If the null is not feasible, then the alternative must be.– Written as: Ha: ‘value’, or < ‘value’, or >

‘value’

Step by Step: The One-Sample Test of Hypothesis Using the z-test.

1. State Research Question

2. Establish the Hypotheses

3. Establish Level of Significance

4. Collect Data

5. Calculate Statistical Test

6. Interpret the Results

Is the average IQ of students in that particular system different from the national average?

population

sample

0

X

Difference?sampling

1. Stating the Research Problem

Null Hypothesis

The mean IQ is not equal to 100.Research or Alternative Hypothesis

100:Ha

The mean IQ is 100.

100:Ho

2. Establish the Research Hypothesis

Alternative or Research Hypotheses

• The Alternative Hypotheses may take either a non-directional form, = ‘value’.

• The Alternative Hypotheses may be a directional hypothesis, > ‘value’ or < ‘value’.

• The decision to use a directional alternative is based on the research question under investigation.

Errors in Decisions

Our Decision

Null is Really True

Null is Really False

Accept Null Hypothesis

Good Decision Bad Decision

Type II Error ()

Reject Null Hypothesis

Bad Decision

Type I Error ()

Good Decision

(Power)

3. Establish the Level of Significance

is the probability of rejecting a true null hypothesis and will be equal to the area NOT within the area we would expect to find our sample mean (e.g., if we use 95% under the curve, then is .05).

defines what is called the “Rejection Region” because we will reject the Null if our calculated z statistic is in that region.

Graphical Depiction of Rejection Region

Hypothesized

Rejection Region Rejection Region

Rejection RegionDirectional Hypotheses

This would represent a directional hypothesis > ‘value’. The total area would be on only one side, e.g., .05, thus the critical value of z would be 1.645 rather than 1.96, giving a greater likelihood of rejecting the Null Hypothesis.

+1.645

Rejection RegionDirectional Hypotheses

This would represent a directional hypothesis , ‘value’. The total area would be on only one side, e.g., .05, and again, the critical value of z would be 1.645 rather than 1.96, giving a greater likelihood of rejecting the Null Hypothesis.

-1.645

Rejection Region

• It is determined by the Alpha () selected. defines how much of the area under the curve will be in the rejection region.

• The probability of rejecting a TRUE Null Hypothesis is equal to the area in the rejection region, since a sample mean will only be obtained that frequently; if the Null Hypothesis is True.

Student IQ12

81

109 88

122

A random sample of 81 students were given the IQ test and their scores were recorded. The mean of the sample was 105.

4. Collecting the Data

5. Analyzing the Data: Calculating the Test

Statistic

x

XZ

The statistic we will calculate to determine if the Research Hypothesis is tenable is a modification of our z score.

Notice the new formula uses the data from the sampling distribution and the population mean m divided by the standard error. These are exactly as we discussed in Estimation.

S

XXZ

sX '

Calculating the Z Statistic

• From our sample of 81 students, we calculated the sample mean to be 81.

• The population standard deviation is 15.

• Using our Z test formula we can determine where our sample mean would fall, if the population mean m is 100.

The Z Statistic

99.2667.1

5

81

15100105

x

XZ

Thus, our sample mean lies 2.99 standard deviations (standard errors) above the population mean of 100.

Locating our Mean on the Sampling Distribution of Means

– 1.667 + 1.667

10098.33 101.67 105

95% of all means

6. Interpreting the Results

• Since our mean is not in the area where we would expect 95% of all sample means from a distribution where the population mean is 100, we would reject the Null Hypothesis that = 100 and accept the alternative that it is different = 100.

• We would state that we reject the Null Hypothesis at the .05 level of confidence.

Problems with Z Test

• The z test requires that we know the population standard deviation, which we usually do not know.

• The z test is designed for large samples (n> 30), again which we don’t always have.

• What is the solution?

Solution

Use a t-distribution rather than the z-distribution

n

Xz

0

nsX

t 0

(See page 297-298)

Characteristics

• Mean of 0• Mound-Shaped

(Normal)• SD is same as

population except divided by the square root of n

• Mean of 0• Mound-Shaped (Not exactly

Normal)• SD is same as the sample

except divided by square root of n

• Thus, t is more variable than z--depends on degrees of freedom

z-distribution t-distribution

Understanding the t-Distribution

N

μXσ

22

Recall the difference between the sample and population variances.

1n

XXS

2

2

Population Sample

Other than using sample numerical indicators rather than population numerical indicators, the only difference is that the sample variance is divided by “n - 1” rather than “N”.

The reason for this when S2 is used to estimate σ2, it tends to underestimate it. A man named William S. Gosset discovered that dividing by n-1 corrected this problem. He also discovered that n-1had greater significance in statistics and it was called the degrees of freedom.

William GossetWilliam Gosset was the quality control engineer at Guinness Brewery in London in the early 1900s. For some reason, getting samples of 30 or more of his produces proved difficult. This prompted his search for a small-sample statistic that resulted in his publication of the t-test. He published it under the pen name of “Student” for a couple of reasons. First, moonlighting was frowned upon by Guinness. Second, he wanted to honor his teachers, particularly Karl Pearson.

Review Degrees of Freedom

3

?83

3

15

3

15

3

?5

n

XX

Degrees of freedom are the number of observations free to vary, thus the number of observations that contribute to the variance in a sample.

Recall the formula for the sample variance:

In order to compute the variance, we first must compute the sample mean. We find the sample mean by summing the scores and dividing by n.

Consider a problem with n = 3 and the sample mean = 5. Thus,

And if we know the first two numbers are 3 and 8….

The ? must be 4. Thus, only 2 of the 3 scores are free to vary. That is to say there are n – 1 degrees of freedom.

1

2

2

n

XXS

This concept of degrees of freedom is used for many different statistics, not just the t-statistic.

The t-distribution is presented in tables, but not complete tables like the normal curve z-scores. This is because it would take a different table for each different degree of freedom. Thus, only commonly used alpha values are tabulated.

The current rate for producing 5 amp fuses at Moe’s Electric Company is 250 per hour. A new machine has been purchased and installed that, according to the supplier, will increase the production rate. Is the new machine faster than the old one?

New Research Situation

population

AXSample

? population

Is there significant difference in mean score of Dependent variable between Sample and

Population?

Step by Step: The One-Sample Test of Hypotheses using the t Test

1. State Research Question

2. Establish the Hypotheses

3. Establish Level of Significance

4. Collect Data

5. Calculate Statistical Test

6. Interpret the Results

1. Stating Research Problem:

Is the production rate of new machine more than 250 per hour?

population

sample

0

X

Difference?sampling

2. Setting Hypotheses

The mean production rate is greater than 250 per hour.

Null HypothesisThe mean production rate of the new machine is equal to or less than 250 per hour. Notice that the null now contains the other side of the directional alternative!

Research or Alternative Hypothesis

One-tail test

250μ:Ho

250μ:Ha

3. Setting your level of significance

0 0: 250H

1 1: 250H 0

Rejecting 0 0: 250H Accepting

Two-tailed test of significance

One-tailed test of significance

1 1: 250H 1 1: 250H

.05if

.05

Possible Conclusion

The sample is from a population with a mean greater than that of the null hypothesis

4. Collecting Data

Hours Production12

10

254253

250

A sample of 10 randomly selected hours from last month revealed the mean hourly production on the new machine was 256, with a sample standard deviation of 4.67 per hour.

5. Analyzing the Data: Calculating the Test

Statistic

xs

μXt

The t-test formula uses the data from the sampling distribution and the population mean m divided by the standard error, which is now defined by using the sample standard deviation and not the population standard deviation as required in the z test. These are the same as we discussed in Estimation.

sX '

Steps in Calculating the t Test

• From our sample of 10 randomly selected hours, we calculated the sample mean to be 256 and the standard deviation to be 4.67.

• Using our t test formula we can determine where our sample mean would fall, if the population mean is 250.

Calculating the t statistic

xs

μXt

= 4.06

1.48

6

10

4.67250256

Thus, our obtained mean from our sample is 4.06 standard errors above the hypothesized mean of 250. Would this value be in our rejection region?

YES!

Graphical representation of our calculations

.05

1 1: 250H

0 4.06

Rejecting 0 0: 250H Accepting 0 0: 250H

.05

0 0: 250H

1 1: 250H

0 4.066

Rejecting 0 0: 250H Accepting 0 0: 250H

Difference in standard errors between sample mean and population mean

250 256

4.066

Sampling Distribution of X

6. Interpreting the Results

6. Interpreting the Results

• Since our sample mean is not in the area where we would expect 95% of all sample means from a distribution where the population mean is 250 and is in our rejection region, we would reject the Null Hypothesis that = 250 and accept the alternative that it is different >250.

• We would state that we reject the Null Hypothesis at the .05 level of confidence.

Assumption RequiredSince the n is too small to invoke the Central Limit Theorem, we can no longer be sure that the sampling distribution is normal. In fact, we have already learned that it is not, it is a t-distribution. In order for this to happen, we must assume that the original distribution is normally distributed.

The smaller the n, the more important this assumption is. For example, with an n of 4, normality might be important. However, by the time n is 30 or more, the normality assumption is not necessary as the CLT takes over. We can see this by examining a t-table (Table D).

Assumption SummaryFor a one-sample z-test, there is only one assumption:

●Sample was obtained randomly

For a one-sample t-test, there are two assumptions:

●Sample was obtained randomly●Original population is normally distributed

Comparing Estimation and Hypothesis Testing

• In estimation, we use data from a sample to estimate where we think, with a declared level of confidence, that the population mean to be.

• In hypothesis testing, we use data from a sample to evaluate the acceptability of the hypothesized value for the population mean.

Comparing the Two Formulae

n

stXCI

ns

μXt 0

Notice that I have changed the earlier confidence interval formula with the information from the t distribution. Also note the same standard error is used in both.

Calculating the t Statistic and the 95% Confidence Interval

xs

Xt

06.4

48.1

6

10

67.4250256

)48.1(833.1256%95

n

stXCI

=

*The value of 1.833 (called the critical value) is found in Table D, page 538, using one-tailed .05 and 9 degrees of freedom. We must remember that our alternative was that was greater than 250, which it was.

Comparing the 95% Confidence Interval and the t-test.

95% CI = 256 + 2.7 = 258.7 and = 256 – 2.7 = 253.3.

We are 95% confident that the population mean is between 253.3 and 258.7, based on the data from our random sample.

The calculated t statistic was 4.06, meaning that the obtained mean is 4.06 standard errors above the hypothesized mean of 250, and therefore we rejected the null of 250.

A Graphical Comparison

250Hypothesized

Rejection Region

t =4.06

253.3 258.795% confidence Interval

Arriving at the Same Conclusion

• Notice that the confidence interval does not contain the hypothesized value [250], thus it is a bad guess.

• General rule: If the confidence interval does not contain the hypothesized value we will reject the null hypothesis, the same as if we had calculated the t statistic.

• Thus, we can conduct a test of hypothesis simply by calculating the confidence interval.