Lecture notes for statistics for BADM 604faculty.citadel.edu/silver/statistics.pdfStatistics for the Business Majors: Lecture notes by Stephen Jay Silver, Ph.D. Professor of Business

Statistics for the Business Majors: Lecture notes

by

Stephen Jay Silver, Ph.D.

Professor of Business Administration

The Citadel

This is a draft of a text that will be made available online in the very near future.

Please do not distribute this outside the confines of The Citadel.

2

Introduction and Preface

The content of this manual represents the essential elements of statistical theory and its

applications through problems and examples. What makes this manual unique is not so

much the content itself as the supplemental material made available to students using the

manual. These include a series of videos of studio lessons that I developed in the spring

and summer of 2008. Hyperlinks to these lecture videos appear at the end of this

preface/introduction.

In addition to what appears in this manual and the videos, underlying proofs (which are

not essential to the understanding of the material, but are presented parenthetically for the

“purists” in the class), problem sets and demonstrations of concepts are linked to the text

itself.

Also, over the years I have created an Excel Statistical Template that anyone can use to

reduce the amount of calculations, and thereby the chance for error, when applying

formulas to real-world problems. I have allowed smaller sections of my statistics classes to

use the template to work problems on exams in order to avoid computational errors and

shorten the time required for the tests.

The URL for this application is http://faculty.citadel.edu/silver/Statistics Template.xls.

When queried at first whether to save it or open it, elect to save and then open it.

Afterward you may save the file to your hard drive for later use.

The lecture notes themselves are quite condensed and do not make for light bedtime

reading; however, the videos should go a long way toward illuminating many of the points

made in the notes. So students that have difficulty understanding the material contained in

the notes are encouraged to view these videos, whose hyperlinks are shown on the next

page.

I should also mention that an alternative set of videos developed for my BADM 710 MBA

class, including topics other than statistics, is also available. I begin that course with

several lectures on optimization and the calculus. Hyperlinks to these lecture videos can be

found at my website at http://faculty.citadel.edu/silver. Click on BADM 710 and hyperlinks

can be found on that syllabus. In fact, the first statistics lecture video in the BADM 710

notes begins with my going over a problem that students may wish to ignore; they should

wait for the video to load fully and fast forward over this material.

Finally, a separate set of probability lecture notes may be included in the course; these are

found at http://faculty.citadel.edu/silver/ba604/prob_theory.pdf. Hyperlinks to the lecture

videos are given at the top of the notes as well as in the links just below this introduction.

Statistics Videos to accompany Professor Silver’s Statistics Lecture Notes

BADM 205 Business Statistics Online Video Lessons

http://faculty.citadel.edu/silver/Statistics%20Template.xls

http://faculty.citadel.edu/silver

http://faculty.citadel.edu/silver/ba604/prob_theory.pdf

3

Lesson 1 Part 1 Introduction to Statistics

http://faculty.citadel.edu/silver/BA205/Online course/Lesson 01-part 1.wmv

Lesson 1 Part 2 Descriptive Statistics I

http://faculty.citadel.edu/silver/BA205/Online course/Lesson 01-part 2.wmv

Lesson 2 Descriptive Statistics II Algebra of Summation

http://faculty.citadel.edu/silver/BA205/Online course/Lesson 02.wmv

Lesson 3 Descriptive Statistics III Relative Location


Lesson 4 Probability I


Lesson 5 Probability II Conditional Probability, Independence, Bayes


Lesson 6 Discrete Probability Distributions I


Lesson 7 Discrete Probability Distributions II Binomial and approximations


Lesson 8 Continuous Random Variables


Lesson 9 The Central Limit Theorem and confidence intervals


Lesson 10 The Hypothesis Test of the population mean and proportion


Lesson 11 The Hypothesis Test Differences in means and proportions


Lesson 12 The Analysis of Variance Test for Differences in means of several populations


Lesson 13 The χ2 Test of Independence


Lesson 14 Simple OLS Regression


Project 1 Descriptive Statistics

http://faculty.citadel.edu/silver/BA205/Online course/Project 01.wmv

Project 2 Demonstration of the Central Limit Theorem


Project 3 Condo Case, Multiple OLS Regression model


http://faculty.citadel.edu/silver/BA205/Online%20course/Lesson%2001-part%201.wmv

http://faculty.citadel.edu/silver/BA205/Online%20course/Lesson%2001-part%202.wmv

http://faculty.citadel.edu/silver/BA205/Online%20course/Lesson%2002.wmv













http://faculty.citadel.edu/silver/BA205/Online%20course/Project%2001.wmv



4

Statistics Lecture Notes – Part 1 Probability Distributions and Density Functions

I. Population, data and random variables

A population is selected and data chosen from that populationThe shape of the

population is called its distribution. A particular outcome of a drawing from a

random variable, or generator of values, is called a random variate. A random

variable does not refer to a single outcome of the sampling but to the process by

which outcomes occur.

Each distribution has one or more values, called the parameters of the

distribution. For example, the mean, median and the standard deviation are

parameters of a distribution.

Common distributions may be either continuous, such as the normal and

exponential, or discrete, such as the Poisson and the Binomial. [These will all be

covered later.]

Other distributions may not be of a standard form, but may be defined by a

listing of values and associated probabilities.

Often we do not know the distribution of the population and may draw values

from it randomly, called random sampling, to try to describe the distribution

generating the values and to estimate the distribution’s parameters.

A random sample is one in which every element within the population has an

equal chance of being selected.

II. Location and spread; two important measures of a distribution

The mean is the average value of the distribution; we denote the population mean

as the small Greek letter form, mu = μ.

Other measures of central location are the mode and the median. The median is

the value that half the population exceeds and the other half is less than. The mode

is the most frequent value, in the case of a discrete distribution, or the highest point

on the frequency plot of a continuous distribution.

For some populations the median or mode is a better measure of location than the

mean. For example, the most common color or make of automobile – the mode –

makes more sense than the mean or mode. In fact, with qualitative data such as

color, model, or gender, the other measures make no sense. And where outliers may

skew the data severely, such as income or housing prices, we usually give the median

value rather than the mean.

Measures of spread include the range or the distance between the lowest and

highest values of the distribution; the variance or the average squared deviation for

the mean; and the standard deviation, which is the square root of the variance.

We write the mean as the expected value of the random variable X or E(x), The

variance therefore can be written as σ2 = E[(x – μ)

2] and σ = √ – 2

].

For any population, Chebychev proved that the maximum fraction of the

population, for any distribution, that can lie more than kσ away from the

population mean is 1/k2 for k >1; thus, at least 1 – 1/k

2 must lie within kσ of μ.

The Empirical Rule for mound-shaped populations relies on the areas under the

5

normal distribution curve, the so-called “bell curve”: about 2/3 will be within one σ,

about 95% within two σ, and almost the entire population within three σ from the

mean.

III. Using data from random samples to estimate the mean and variance.

Let us draw a random sample from a population. This means that each element

in the population has an equal chance of being selected. Call these observations x1,

x2, …, xn. Then we estimate the mean and variance as follows.

i) = (x1 + x2 + … + xn)/n = x1/n + x2/n + … + xn/n. This can also be

written as = ∑x/n [we omit that we are adding up n x’s as this is

understood. To refresh yourself on summation notation, go to the algebra

of summation page]

ii) s2 = [(x1 - )

2 + (x2 - )

2 + … + (xn - )

2]/(n-1)

= Σ(xi – )2/(n-1). So it is approximately the average of the squared

deviations from the mean.

The reason we divide by n-1 instead of n is that the estimated result would be too

small on average because we use our estimate of the population mean, , instead of

the actual value μ, which is unknown. It can be shown that on average dividing by

n-1 gives an unbiased estimate, meaning on average our estimated variance equals

the true variance; that is, E(s2) = σ

2. Two important characteristics of the variance

estimate s2 are its unbiasedness as an estimator of μ, and that using minimizes the

sum of squared deviations, which is the numerator of s2. [Proof] An alternative

formula for s2 is (Σx

2 – n* 2)/(n-1). [Proof]

Also, it is interesting to understand the function of the median as a measure of

location and when it is more appropriate to use the median rather than the mean.

Let us look at the following statistic, called the mean absolute deviation. Suppose

we have a measure of centrality x* and we take the absolute differences of our

sample data from that point. The sum of these absolute differences is Σ|x – x*| and

the mean absolute difference, or MAD, is the sum above divided by n. The median

minimizes MAD. [Proof]

IV. The mean and standard deviation of a generic probability distribution

The table below is an example of a probability distribution. The values of X are

the number of absences per semester of my BADM 710 students. Thus, 40% of my

students miss none of my classes, 30% miss exactly one class, etc.

X P(X)

0 .4

1 .3

2 .2

3 .1

http://faculty.citadel.edu/silver/ba604/algebra_of_summation.htm

http://faculty.citadel.edu/silver/ba604/algebra_of_summation.htm

http://faculty.citadel.edu/silver/Proofs.htm#Proof1



6

Note that the P(X) are all non-negative (for all Xs that do not occur, we assume

P(X) = 0), and = ∑P(X) = 1. Any distribution having these characteristics is said to

be a probability distribution or PD.

The mean of a PD is the weighted average of the X values, where the weights are

the respective probabilities. Thus, for the above distribution, μ = ∑X*P(X) = 0(.4)

+1(.3) + 2(.2) +3(.1) = 1. [See the table below for calculating the mean and variance]

The variance of a distribution is the weighted average of the squared differences

from the mean; thus, σ2 = ∑(X – μ)

2*P(X) = (-1)

2(.4) + 0(.3) + 1

2(.2) + 2

2(.1) = .4 + 0

+ .2 + .4 = 1. So σ = 1.

X P(X) XP(X) (X-μ)2P(X) X

2P(X)

0 .4 0 (0-1)2(.4) = .4 0

1 .3 .3 (1-1)2(.3) = 0 .3

2 .2 .4 (2-1)2(.2) = .2 .8

3 .1 .3 (3-1)2(.1) = .4 .9

SUM 1 1 2

Another formula for σ2 is ∑(X)

2P(X) – μ

2 = 2 – 1

2 = 1. [Proof]

V. Two simple distributions

Uniform distribution. This distribution may be either discrete or continuous.

For both the discrete and continuous uniform distributions, μ = (a+b)/2, where a

is the smallest value and b is the largest; for any symmetric distribution the mean

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

P(X)

1 2 3 4 5 6 7 8 9 10

X

Discrete Unifiorm

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

f(X)

1 2 3 4 5 6 7 8 9 10

X

Continuous Unifiorm


7

and median are the same value.

For the discrete uniform distribution σ2 = d

2(n

2 – 1)/12, where d is the distance

between successive values and n is the number of x values; for the continuous

uniform distribution σ2 = (b – a)

2/12. [Proofs]

VI. A very important continuous distribution is the normal distribution, or bell

curve.

The probability density function (pdf) of the normal distribution is

f(x) = 1/(σ√√ )*

.

While this looks ugly, it actually has nice properties. For example, whether x-μ is

positive or negative, when squared it has the same value; thus, the distribution is

symmetric around x = , where [(x-μ)/σ]2 is minimized, but since we take e to the

negative power, f(x) is maximized at the mean.

Also, it can be shown that the function ef(x)

has a slope (derivative) at all real f(x),

so it cannot come to a point, but is smooth. Finally, since the value of f(x), which is a

power of a positive number e, must always be positive, f(x) never crosses the x axis.

All these together imply a bell-shaped distribution. The positive constant term in

front, 1/(σ√ ), merely standardizes the area under the curve to be 1. The only

other issue is why the .5 in the exponent. Well, it just is. In future we will indicate

that a random variable X is normally distributed with mean μ and standard

deviation σ as follows: X ~ N(μ, σ).

To find probabilities we standardize the normal using z-scores, where Zz = (x –

μ)/σ is the number of standard deviations x is from the mean.

IQ tests are normally distributed and have a mean of 100 points and a standard

deviation of 15 points. That is, the average child has a 100 IQ and on average a

child’s IQ differs from the mean by 15 points.

Z-scores and using the table; find P(IQ > 115). Since 115 differs from the mean

Standard Normal Distribution

0.00

0.10

0.20

0.30

0.40

0.50

-3.00 -2.50 -2.00 -1.50 -1.00 -0.50 0.00 0.50 1.00 1.50 2.00 2.50 3.00

X

f(X

)


8

by one standard deviation (the z-score of 115 = (115-100)/15 = 1) we look on the

normal table for 1.00. We see the area is .3413, which means that 34.13% of the

population lies between Z = 0 and Z= 1. Since half the population lies to the right of

the mean (it’s symmetric), .5 - .3413 = .1586, or 15.87% of children have IQs above

115 (and below 85)

a. Find following probabilities

P(IQ < 105) (.5 + .1293 = .6293)

P(IQ < 80) (.5 - .4082 - .0918)

P(110 < IQ < 130) ( .4772 -.2486 = .2286)

P(80 < IQ < 115) (.4082 +.3413 = .7495)

P(80 < IQ < 90) (.4082-.2486 = .1596)

VII. Continuous distribution vs. discrete distribution mean and variance

Again, the mean of a discrete PD is μ = ∑X*P(X) and the variance

σ2 = ∑(X – μ)

2*P(X)

The equivalent for a continuous distribution is to substitute in both formulas

f(x)dx for P(X) and to use the integral instead of the summation; thus ,

μ = ∫

and σ

2 =∫ –

2

f(x)dx.

Another PD example:

a. Let P(X = k) = (1/2)k , k = 1,2,3, …

b. Σ P(X) = 1, all P(X) ≥ 0, thus P(X) is a probability distribution

c. Find μ and σ (Don’t try it; it’s not for the faint of heart. Use Excel

spreadsheet to estimate it to three decimal places) We see that μ = 2

and σ2 = 6 – 2

2 = 2. For a formal proof, click HERE.

VIII. The Poisson Distribution

P(X = k) = e-λ

λk/k!; μ = λ, σ

2 = λ (See Excel estimates for λ = 1 and λ = 2)

Example for λ = 3 using the Poisson Table below

k P(X=k) X*P(X) (X-3)^2*P(X) CumP(X)

0 0.0498 0.0000 0.4481 0.0498

1 0.1494 0.1494 0.5974 0.1991

2 0.2240 0.4481 0.2240 0.4232

3 0.2240 0.6721 0.0000 0.6472

4 0.1680 0.6721 0.1680 0.8153

5 0.1008 0.5041 0.4033 0.9161

6 0.0504 0.3025 0.4537 0.9665

7 0.0216 0.1512 0.3457 0.9881

8 0.0081 0.0648 0.2025 0.9962

9 0.0027 0.0243 0.0972 0.9989

0.9989 2.9886 2.9400

Poisson Distribution with lambda = 3

μ approaches 3, so does σ

http://faculty.citadel.edu/silver/normal.pdf

http://faculty.citadel.edu/silver/PD_example-2.pdf

http://faculty.citadel.edu/silver/PD_example-2.pdf


http://faculty.citadel.edu/silver/Poisson_mean&variance.htm#lambda1

http://faculty.citadel.edu/silver/Poisson_mean&variance.htm#lambda2

9

Uses of the Poisson distribution: Cars at a toll plaza; customers at a bank; phone

calls at a switchboard.

i) If three cars per minute arrive at a toll plaza, and arrivals are

Poisson-distributed, use the table above to find the probability

that during the next minute

no cars arrive (P(k = 0) = .0498)

exactly two arrive (P(k = 2) = .2240)

more than five arrive (1 – P(k ≤ 5)) = 1 - .9161 = .0839)

between two and four cars, inclusive, arrive

(P(k ≤ 4) – P(k ≤ 1) = .8153 - .1991 = .7162)

ii) Approximation of the Binomial distribution, which we discuss in

the next topic.

IX. The Binomial Distribution [A related topic for sampling from finite

populations is covered in the Hypergeometric Distribution]

The binomial distribution is used when only one of two possible outcomes will

occur on a trial of an experiment conducted from an infinite population; thus, two

values are possible. We assign the values of 1 for a “success” and 0 for a “failure”.

Thus:

“Success” with probability p; X = 1 and “Failure” with prob. 1 - p; X = 0.

Examples are tossing a fair coin, finding defects when sampling relatively few parts

in a factory, etc. The parameters are μ = p, the probability of a success, and σ2 =

p(1-p) (See table below)

X P(X) XP(X) (X-p)2P(X)

0 1-p 0 (-p)2(1-p) = p

2(1-p)

1 p p (1-p)2 p

Sum p p(1-p)

For the binomial distribution for n trials, let Y = ΣX; thus, Y = number of

successes in the n trials. Statistically, it can be shown that E(ΣX) = ΣE(X) = np = μ.

And, if the trials are independent, Var (ΣX) = ΣVar(X) = nσ2 = np(1-p). The

probability of k successes in n trials is given by P(X=k) = p

k(1-p)

(n-k).

is the

0

0.05

0.1

0.15

0.2

0.25

Pro

ba

bilit

y

0 1 2 3 4 5 6 7 8 9

x

Poisson, lambda = 3

http://faculty.citadel.edu/silver/ba604/hypergeometric.pdf

10

number of ways you can obtain k successes in n independent trials and equals

n!/[k!(n-k)!]; the rest of the expression is the probability of each way. These

calculations will be unpleasant to do for fairly large n and k, so tables for particular

values are available in most textbooks.

Example: for n = 20, p = .6, see the graph and cumulative table below for the

probabilities for this binomial distribution. Use the table to find the following

probabilities for the number of successes in 20 trials:

i) exactly ten (Hint; P(X ≤ 10) – P(X ≤ 9) (.1171)

ii) more than thirteen (Hint: 1 - P(X≤13) (.2500)

iii) between ten and fifteen, inclusive (.8215)

0

0.05

0.1

0.15

0.2

P(X = k)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

Binomial Distribution: p = .6, n = 20

Binomial Distribution; p = .6, n = 20

k P(X=k) P(x<k)

0 0.0000 0.0000

1 0.0000 0.0000

2 0.0000 0.0000

3 0.0000 0.0000

4 0.0003 0.0003

5 0.0013 0.0016

6 0.0049 0.0065

7 0.0146 0.0210

8 0.0355 0.0565

9 0.0710 0.1275

10 0.1171 0.2447

11 0.1597 0.4044

12 0.1797 0.5841

13 0.1659 0.7500

14 0.1244 0.8744

15 0.0746 0.9490

16 0.0350 0.9840

17 0.0123 0.9964

18 0.0031 0.9995

19 0.0005 1.0000

20 0.0000 1.0000

11

From the discussion above, we see that we can now use four different methods to

calculate binomial probabilities: the binomial formula; the binomial tables (or Excel

for n or p not listed in the tables); the Poisson (for large n and small p); and the

normal (whenever np and n(1-p) are both ≥ 5).

We have developed an excel spreadsheet that will allow the user to determine

which method to use for given set of n and p. The sheet is found in the Statistical

Template; you need to go to the sheet titled “Three Distributions”. Below is the plot

for n = 1000 and p = .01. We see that the preferred approximation technique is the

Poisson, whose probability values track the binomial probabilities almost perfectly

for all values of X.

IX. One more distribution; the (negative) exponential distribution

This distribution is related to the Poisson; it is the time between successes rather

than number of successes. This is the inverse function of the Poisson, which gives

occurrences per period of time.

As time is continuous, the exponential distribution is continuous. f(x) = λe–λt

,

where λ ≥ 0; λ is the same parameter as for the Poisson, that is, the mean number of

successes per time period. The mean is 1/ λ, and P(Time > t) = e- λt

.

Example: on average customers arrive at a bank every two minutes, that is, the

mean number of arrivals per ten minute period is five.

From the table below find the following probabilities for the time between the

next two arrivals

i) more than half a minute (Hint: half a minute = .5/10 = .05 10-minute

Comparison of Three Distributions: Binomial with p and n

and the approximate Normal and Poisson Distributionsp = 1.00E-02

n = 1000np = 10

24

n Binomial Poisson Normal

-2 0 0 9.31E-05

-1 0 0 0.000295

0 4.31712E-05 4.54E-05 0.000844

1 0.000436073 0.000454 0.002185

2 0.002200188 0.00227 0.005119

3 0.007393223 0.007567 0.010851

4 0.018613745 0.018917 0.020809

5 0.037453112 0.037833 0.0361

6 0.062737115 0.063055 0.056658

7 0.089986568 0.090079 0.080448

8 0.112824069 0.112599 0.10334

9 0.125613329 0.12511 0.120093

10 0.125740211 0.12511 0.126261

11 0.114309283 0.113736 0.120093

12 0.095161516 0.09478 0.10334

13 0.073053285 0.072908 0.080448

14 0.052022794 0.052077 0.056658

15 0.034541734 0.034718 0.0361

16 0.02147955 0.021699 0.020809 0

17 0.012558454 0.012764 0.010851 0

18 0.006927587 0.007091 0.005119 0

19 0.003616635 0.003732 0.002185 0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

-2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Binomial

Poisson

Normal

12

periods, and λ = 5 per 10-minute period) ( .7788)

ii) less than one minute (.3935)

iii) between one and two minutes (.6065 – .3679 = .2386)

Exponential Distribution; λ = 5

0

1

2

3

4

5

6

00.

10.

20.

30.

40.

50.

60.

70.

80.

9 11.

1

Exponential Distribution; λ = 5

t f(t) dt*f(t) P(T>t) = e- λt

0.00 5.000 0.2224 1.0000

0.05 3.894 0.1732 0.7788

0.10 3.033 0.1349 0.6065

0.15 2.362 0.1050 0.4724

0.20 1.839 0.0818 0.3679

0.25 1.433 0.0637 0.2865

0.30 1.116 0.0496 0.2231

0.35 0.869 0.0386 0.1738

0.40 0.677 0.0301 0.1353

0.45 0.527 0.0234 0.1054

0.50 0.410 0.0183 0.0821

0.55 0.320 0.0142 0.0639

0.60 0.249 0.0111 0.0498

0.65 0.194 0.0086 0.0388

0.70 0.151 0.0067 0.0302

0.75 0.118 0.0052 0.0235

0.80 0.092 0.0041 0.0183

0.85 0.071 0.0032 0.0143

0.90 0.056 0.0025 0.0111

0.95 0.043 0.0019 0.0087

1.00 0.034 0.0015 0.0067

1.05 0.026 0.0012 0.0052

1.10 0.020 0.0009 0.0041

1.15 0.016 0.0004 0.0032

Total P 1.0024

13

Statistics Lecture Notes – Part 2 Inferential Statistics of one variable

In this section of the Notes we will address the most important use of statistics:

making inferences about the populations being investigated. We begin by discussing

the central limit theorem and its applications in statistical inference.

I. The Central Limit Theorem (CLT)

This is without doubt the most important theorem in statistics – I like to refer to it

as the fundamental theorem of statistics – and almost all of what we do in statistics

is based on the CLT.

The central limit theorem states that, for large sample size n, the distribution of

the sample mean for an independent random sample of any distribution is

approximately normal with mean μ and standard deviation σ/√n. Practically, for

most distributions, large means n ≥ 30. [See CLT]

Thus, for large n, the statistic ∑x/n, and therefore ∑x, is normal. This implies that

for large n the distribution of the number of successes divided by the number of

trials, which is the sample proportion of successes, for a binomial experiment is

normal with mean μ = p, and = σ/√n = √ √ n = √ Large

means that np ≥ 5 and n(1-p) ≥ 5.

∑x is also normal, and its mean is just n times p or np and the standard deviation

is also n times as large, or √ .

Applications: Toss a fair coin 100 times. np = n(1-p) = 50; μ = np = 50; σ =

√ = 5. From the CLT the observed number of successes is approximately

normal with mean = 50 and σ = 5. Use the normal approximation to find the

following:

i) P(X = 50) [Hint: 50 lies between 49.5 and 50.5]

ii) P(X > 55) [Again use the continuity adjustment of .5 units]

iii) P(45 ≤ X ≤ 58)

Now p = .5, so is normal with mean μ = .5, and σ = .05; therefore find the

probability the sample proportion is

iv) more than 60 percent

v) less than 45 percent

vi) between 45 and 50 percent

II. For a binomial random variable with large n and small p, and np < 5.

A better approximation is often obtained using the Poisson with λ=µ=np. [Proof]

Example: Each day an inspector selects 200 parts randomly. If one percent of the

parts are expected to be defective, find the probabilities:

http://faculty.citadel.edu/silver/CLT.pdf


14

i) none are defective [λ= 2, P(x = 0) = .135]

ii) exactly two are defective [Hint: P(x ≤ 2) – P(x ≤ 1)]

iii) more than six are defects [1 – P(x ≤ 6)]

iv) between two and five, inclusive, are defects [You do it]

III. Applying the CLT to a large random sample

Each month the Bureau of the Census conducts a random sample of 50,000 U.S.

households for the Labor Department. Let’s suppose the survey includes two

questions:

b. “What do you estimate your household’s income was last month?”

c. “How many members of your household were unemployed last month?”

[In fact, the questions are a bit more sophisticated than these, but let’s suppose

this is the tenor of the questions]. Now suppose the survey found the following:

The mean for salaries was $4100, with a standard deviation s of $1050, and

Of the 55,600 household members determined to be in the labor force, 5,060

were unemployed.

From sampling theory, we know that for the salary data is normally-distributed

with mean μ and standard deviation = σ/√ . While we don’t know the true σ,

due to the size of the sample, s ≈ σ and σ/√n ≈ 1050/√ = $4.70. That is, the

mean of the population, whatever it is, differs from the $4100 we got from the

survey by about $4.70. That means our survey value is really quite accurate.

We are now ready to construct what we call a confidence interval for μ. We know

that the true average monthly salary is not the sample mean of 4100. But we also

know it is quite close. Now from the normal distribution, we can find what fraction

of the population lies within a given number of standard deviations from the mean.

In fact, we see from the Z table that .4750 of the population lies within 1.96 standard

deviations on each side of the mean.

Thus, an interval around the mean that is 1.96σ/√ will contain 95% of all sample

means drawn from samples with n = 50,000. But to say this means that there’s less

than a 5% chance that the true mean lies beyond 1.96($4.70) = $9.21 from the

observed mean of this particular sample. Constructing this interval yields the

following result: we are 95% confident that the true mean of the population,

whatever it is, lies inside 4100 ± 9.21 = (4090.79, 4109.21). Thus, the formula for a

95% confidence interval is ± 1.96σ/√ . In general, for any confidence interval

when σ is known, is given by ± (σ/√ ), where α is the area in the two tails and

1 – α is the area under the curve within the confidence interval.

Similarly, for the true unemployment rate of the population we use sampling

theory to tell us that the standard deviation of the sample proportion – the expected

difference between the true population proportion and the proportion obtained

from a sample of size n – is √ . Of course we don’t know the actual

proportion p of the population that is unemployed. But again, since the sample size

is very large, 55,600, we can assume that the observed sample proportion is quite

close to the true proportion.

From our sample, = 5060/55,600 = .091. So √ =

15

√ = .0012, or a little over one-tenth of one percent. Again, to

construct a 95% confidence interval for p, we get .091 ± 1.96(.0012) = .091 ± .0024 =

(.0886, .0934). In fact, this is quite a wide interval for a statistic as important as this

month’s unemployment rate. That explains why one needs to be very careful in

interpreting any one month’s estimate as being important. It’s only by watching

this statistic over time that we can be sure what is happening in the labor market.

On average this statistic is off by about 1.2%. Often the researcher will say

something like “last month’s unemployment was 9.1% ± .24%”. Generally, surveys

give the error associated with a 95% confidence interval. They will simplify this by

using 2σ rather than 1.96σ

Example: In a hotly contested election, candidate A’s statistician surveys 1000

potential voters and finds that 512 of the respondents say that if the election were

held today, they would vote for A’s opponent. The remaining 488 voters say they

are going with A. Construct a 95% confidence interval for A.

We used the 95% confidence interval. But we could just as easily have asked for

a 90% or a 99% confidence interval. For the 95% interval the z-score used was

1.96; find the appropriate z-score for the 90% and the 99% intervals. [1.645 and

2.576]

IV. Small sample confidence interval for the population mean

In the previous examples we constructed our confidence interval based on two

facts about large samples: by the CLT, is normal and s is a close approximation

of σ. But often we do not have the luxury of a large sample. So what do we do in

those instances in which we have only a small sample to go on?

Well, we saw from the CLT graphs that in the case of one particular distribution,

the normal, the sample mean for any size n is normally distributed. So, if the parent

distribution from which our sample is drawn is normal, or very close to normal, we

can assume that even for very small samples the sample mean is distributed

normally.

The second issue about how close the sample standard deviation is to the

population standard deviation still remains. Fortunately there is a way to adjust the

Z to produce confidence intervals. The individual that first determined the

corrected intervals was William Gossett, who worked for Guinness Breweries in

Ireland and published under the pseudonym Student. The distribution that bears

his name is called the Student t distribution.

Now, anytime there is additional error created, if that error is balanced between

being too large and being too small, the implication is that the interval will have to

be wider. And since E(s2) = σ

2, the error is balanced. Therefore, the t exceeds the

corresponding z-score.

Now the reason we need to adjust the width of the interval is that s and σ differ.

The magnitude of that difference is based on the size of the sample; for very large

samples the error is very small and we can use the Z. That is, for large n, t ≈ Z. But

for small n the difference between s and σ will be quite large on average and t will

be considerably larger than Z. The measure that determines the amount of

16

adjustment needed is referred to as the degrees of freedom. In the case in which we

estimate the sample variance using the sample mean we divide the sum of squares

by n – 1, so n – 1 is the degrees of freedom and we use . In this case α is the

area in the two tails and 1 – α is the area inside the tails.

To understand how to construct a small-sample interval, let’s do an example.

Suppose we obtained a random sample of 10 IQ scores from children in a particular

town and the mean was 108 and the standard deviation s was 16.5. Now we know

that IQ scores are normally-distributed nationally and therefore will be very close to

normal within the town. Then the 90% confidence interval will be

± (s/√n) = 108 ± (6.5/√10).

To find the t, go to the t-table, .05 column, and ninth row. [See the t table] We see

that t = 1.833, so the interval is 108 ± 1.833 *(16.5/√10) = 108 ± 1.833*5.22 = 108 ±

9.56 = (98.44, 117.56). Thus, we are 90% confident that the mean IQ of children in

this town lies within this interval.

V. The hypothesis test of the population mean and proportion.

Another way to address the issue posed in the analysis above is to ask whether or

not some a priori assumption about the population’s mean or proportion is

supported by the data gathered. In the example just above, we might ask whether

or not the mean IQ for children in the town in which the data were collected exceeds

the national average of 100. The observe IQ suggests this might be the case;

however, it may just be that the sampling error was large.

We would then set up the following hypothesis test:

H0: μ ≤ 100

Ha: μ > 100

α = .05

This particular test is a one-tailed, greater than test. H0 is called the null

hypothesis, and is what is not covered by the alternative hypothesis Ha. Ha states the

hypothesis we are trying to prove and does not contain equality in it; the three

possibilities for Ha are μ > μ0, μ < μ0 and μ ≠ μ0. H0 always has equality: the

corresponding H0 are μ ≤ μ0, μ ≥ μ0 and μ = μ0.

No matter how large a sample we obtain, we can never by 100% sure if we reject

the null hypothesis in favor of the alternative that we are correct. The α, or

significance level, is the probability that we incorrectly reject the null hypothesis. In

other words, we can tolerate falsely rejecting the null hypothesis α of the time. In

the IQ example above we are trying to prove that the average IQ in our town is

above that of the population as a whole.

Suppose we are willing to accept a false rejection level of 5%. So only if our

observed sample mean is too far above the hypothesized mean of 100 can we be 95%

confident that we are not falsely rejecting it.

Now the z-score that we need to obtain = ( – μ)/(σ/√ ), must be large enough

that the probability of exceeding it by chance is less than 5%. We call this z-score

the critical z; in this case, had the sample been very large, Zc is 1.645. (see the Z

table for .4500). But in our example, we don’t know σ; so we need to revise this

http://faculty.citadel.edu/silver/t_table.pdf

17

formula to be = ( – μ0)/(s/√ ) and tc = = 1.833. If the observed t exceeds

1.833, we reject H0 and conclude the town’s mean IQ exceeds 100. We get (108 –

100)/(16.5/√ ) = 1.533 < 1.833, so we fail to reject H0. That is not to say we accept

H0; our best guess still is that the mean exceeds 100. After all, 108 is considerably

larger than 100, but the sample was just too small to reach that conclusion at the

5% significance level.

Let’s now consider the election example where candidate A has only = 488/1000

= .488 of the sample in his favor. We are trying to decide whether or not to throw in

the towel. Suppose we conclude that if she has less than a 5% chance of winning the

election, we will stop throwing good money after bad. Then our test will be

conducted as follows:

H0: p ≥ .50

Ha: p < .50

α = .05

If we reject H0, we quit. Now our zobs = ( – p)/√ = (.488-

.5)/√ = - .012 / 0.01581 = - .76. To reject H0 we needed for

Zc < -1.645, so A should carry on the fight. Are you getting the hang of it yet?

VI. Two population tests of hypotheses.

So far we have limited our analyses to one population, one variable. In this

section we discuss comparisons of means and proportions on one variable across

two populations. We begin with a general discussion of distribution theory.

Suppose two random variables are normally distributed Xi ~ N(μi, σi), i = 1, 2.

Then X1 ± X2 ~ N(μ1 ± μ2, √(σ12/n1 + σ2

2/n2)) and [(X1 - X2) – (μ1 - μ2)]/

√

~ N(0,1) is a Z score.

From the sampling distribution of the sample proportion, it is left as an exercise

to show 1 – 2 ~ N(p1 – p2,√

. Assuming we are testing for p1 – p2

= 0, Zobs = ( 1 – 2)/√

, where = (n1 1 + n2 2)/(n1 + n2). Because we

don’t know what the values are for p1 and p2 and we are testing for the equality of

the two proportions, we use the best guess we have for the common p; that is the

given above.

[As an exercise, show that is both a weighted average of the two observed ,

where the weights are proportional to the sample sizes and sum to one, and is the

same as pooling the observed number of successes from the two samples divided by

the total number of observations in the samples.]

- Two-tailed test of differences in proportions:

H0: p1 = p2

Ha: p1 ≠ p2

18

α = .05

Example: Test for a difference in proportions of two stores selling the store brand of

a product if we obtain the following data in random sample of the two stores.

Number of

total units sold

Number of store

brand units sold

Store 1 1000 350

Store 2 1500 600

From these data, 1 = 350/1000 = .35, 2 = 600/1500 = .4, = (350+600)/(1000+1500)

= 950/2500 = .38. Now rather than calculate the observed Z, we use the Statistics

Template to do the analysis. Go to the Diff proportion sheet and input the values.

You should get the following output.

Since Zobs = 2.523 and Zcrit = ± 1.96, we reject H0 and conclude the proportions

differ at the two stores,

- One-tailed test of differences of means, large samples:

Conduct the two-tailed test of the differences in daily receipts of two stores if

random samples show the following:

Suppose we wish to test, at the 10% significance level whether receipts at Store 2

exceed those at Store 1; that is:

Number of days Average receipts Standard deviation

Store 1 49 $4200 $2100

Store 2 36 $5000 2400

Large -Sample Test for difference between 2 Population Proportion 1 2

Sample size n = 1000 1500 p - hat (pooled) = (x1 + x2)/(n1 + n2) = 950 / 2500

No. Successes (x)= 350 600

= 0.38

p-hats = 0.35 0.4

z = p1 - p2 = -0.05 = Z TS =

-2.523

√ p(1-p) (1/n1 + 1/n2) 0.019816

alpha = 5.00%

alpha/2 = 2.50%

p-value (1 tail) = 0.58%

p-value (2 tail) = 1.16%

t-critical (1-tail) = 1.645

t-critical (2-tail) = 1.960



19

H0: μ2 ≤ μ1

Ha: μ2 > μ1

α = .10

Now since for large sample (n>30), s ≈ σ, Zobs = (4200 – 5000)/√

. Rather

than do the calculations, again we may use the template.

Two sample z-test Template

n1 = 49 n2 = 36

x-bar1 = 4200 x-bar2 = 5000

s1 = 2100 s2 = 2400

tobs = -1.600

p-value(1) = 0.055

p-value(2) = 0.110

alpha = 10.00%

t-crit (1-tail) = 1.282

t-crit (2-tail) = 1.645

Since the observed s-score of -1.60 is less than the critical z of -1.282, larger in absolute

value and negative as it must be, we reject H0 and conclude, at the 10% significance level

that μ2 > μ1.

- Small sample differences in mean.

In comparing means taken from small samples, we again encounter the same problems

we had with the test of the mean: the issue of normality and not knowing the true

variances. The first is easily dealt with if we can assume both populations are

approximately normally distributed. If we were comparing exam scores on standardized

tests, for example, this assumption makes sense.

The second issue is more nettlesome. To see why, look at the test statistic from the large

samples case Zobs = ( 1 – 2)/ √

. In that case we could rely on the large sample

sizes to claim that the sample variances were approximately the same as the population

variances and therefore the Z is distributed approximately normal.

But with small samples the distribution of the test statistic is t-distributed. But with the t

we need to know the degrees of freedom. In this case, however, we have two sources of

error, from each of the two samples, with (n1 – 1) and (n2 – 1) degrees of freedom,

respectively. While both create greater uncertainty, the smaller sample of the two is of

greater concern since that requires greater adjustment from the Z. There are two ways to

deal with the problem; but in the case that we feel the two population variances are about

the same, as would be the case in comparing standardized test scores, we make that

20

assumption (the test for equality of variances will be shown in the next section) and

estimate the common variance as follows:

sp2 = [(n1–1) s1

2 + (n2–1) s2

2]/( n1–1 + n2–1) = [(n1–1) s1

2 + (n2–1) s2

2]/( n1 + n2 – 2)

We then use this value in place of the two variances in the formula; this statistic is

distributed as a t with n1 + n2 – 2 degrees of freedom. To see why we use this formula to

estimate the common variance, recall that s2 = ∑(x – )

2/(n-1) for each sample, or (n-1)s

2 =

∑(x – )2.

Thus, sp2 is a weighted average of the two variances with weights proportional to the

degrees of freedom and summing to one. But it is also the same as throwing all the sum of

squares into one basket and dividing by the sum of the two degrees of freedom. It makes

perfectly good sense to do it this way as we assumed each of the squared differences from

the respective population means is the same size. We lose two degrees of freedom from n1 +

n2 because we allowed each of the samples to select its own mean.

Suppose we wanted to test for the equality of two population means drawn from two

normally-distributed populations with equal variance. First we assume both populations

are normally distributed with equal variances. The sample data are given below.

Using the template (2Sample t-test) we obtain the following results for a two-tailed, 10%

significance level test.

Since the tobs of 2.62 exceeds the critical t of 1.69, we can reject H0 and conclude the

population means are not equal. Note that the degrees of freedom of this test is 36, large

enough to use the Z test; the z critical would have been 1.645, only slightly less than the

correct t of 1.688.

Two sample t-test Template for BUSN 5760

n1 = 16 n2 = 22

x-bar1 = 125 x-bar2 = 108

s1 = 22 s2 = 18

sp = 19.76529

tobs = 2.617733

p-value(1) 0.006432

p-value(2) 0.012865

alpha = 10.00%

t-crit (1-tail) = 1.305514

t-crit (2-tail) = 1.688297

n s

Pop. 1 16 125 22

Pop. 2 20 108 18

21

In the next section, we discuss a test of equal variances. In conducting that test, we may

find that the assumption of equal variances is rejected, in which case an alternative

method must be used. For a discussion of what we may do in this case, read how we can

estimate the appropriate degrees of freedom for the t test without assuming equal

variances here. The small sample difference of means sheet in the Statistical Template

conducts the test of variances automatically. These topics, the F-test of equal variances

and calculating the approximate degrees of freedom, are not covered in the lecture videos.

VII. Test of equal variances

This test relies on the fact that if two populations have equal variances then the ratio of

the sample variances should be one. But since the observed variance for each sample is

not equal to the corresponding population variance, the observed ratio will not equal 1.

Now, take the ratio of the larger of the two to the smaller variance and compare the

quotient to the test statistic, the F statistic with nℓ-1, ns–1 degrees of freedom, where nℓ

represents the number of observations in the sample with the larger variance and ns is the

number of observations in the other sample.

The test is conducted as follows:

H0: σ1 = σ2

Ha: σ1 ≠ σ2

α = α0

In this case we want to fail to reject the null hypothesis in order to use the assumption

that the variances are the same. The assumption is that both populations are normally-

distributed, the same assumption we made to compare the means.

The F distribution has the following appearance:

The area to the right of the ratio of the two variances is the probability that you obtain a

given ratio if the population variances are actually equal. In our case, the ratio is (22/18)2 =

1.493827; the probability of getting this large an observed F-statistic can be found in Excel

F(21,15)-Distribution

0

0.01

0.02

0.03

0.04

0.05

0 0.5 1 1.5 2 2.5 3 3.5

Variance Ratio

F

alpha

http://faculty.citadel.edu/silver/ba604/df_for_small_n1_and_n2.pdf

22

as follows: “=fdist((22/18)^2,21,15)”. The answer is 0.214944. The critical F value is found

in the F-table with the appropriate significance level. Since we used .10 for the test of the

means, we can use the same α for the F-test. The F-Tables hyperlinks are found at

http://faculty.citadel.edu/silver/f_tables.htm; now click on α = .1. While we cannot find 21

df in the numerator, we do see 20 and 25. The critical F value for (20, 15) is 1.92 and for

(25, 15) it is 1.89. Our observed F is well below either if these two values, so we fail to

reject the null hypothesis and can conclude the variances are about equal, which is what we

wanted to show.

VIII. More than two populations comparisons of means and proportions; ANOVA and χ2

tests.

The analysis of variance technique, ANOVA, is used when we wish to make inferences

about the equality of means of three or more populations. As was the case for the two

population case, we assume all populations are normally-distributed and the variances are

equal. The hypotheses are:

H0: μi = μj, all i, j

Ha: μi ≠ μj , some i, j

α = α0

That is, either all the means are the same or two or at least two means differ.

The two graphs above show what the distributions will look like under the two

hypotheses: the one to the left is what we would get if the means are approximately equal,

the one to the left represents the distribution of observed values if there are significant

differences in the population means.

If the first case exists, then the sums of squared differences calculated from the means of

the respective samples, the i, i = 1, 2, 3 in this case, will be insignificantly different, on

average, than if calculated from the overall sample mean T. On the other hand, if the

means of the three populations are not all the same, and at least one of them differs

significantly from at least one of the others, then calculating the sum of squares from the

overall mean rather than from the group means will result in a significantly larger sum. In

this latter case we can partition the errors into two parts, what we call the within groups

sum of squares, SSwg, and the between groups sum of squares, SSbg, and conduct the F-test

on the ratio of these two sums of squares after each sum of squares is divided by its degrees

of freedom. Thus, the F is the ratio of two variances as we did earlier.

To see how this is done, suppose we start by summing up the total sum of squares, SST =

Σ(x – T)2 for all i = 1, 2, … , N, where N is the number of observations in total in our

sample = n1 + n2 + n3. Alternatively, we could calculate this total sum as follows: for the

Means approximately equal

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Pop 1

Pop 2

Pop 3

Means clearly differ

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45Pop 1

Pop 3 Pop 2

http://faculty.citadel.edu/silver/f_tables.htm

23

first group SSbg1 = Σ[(xi – 1) + ( 1 – T)2] for the x values in group 1. Now this sum equals

Σ[(xi – 1)2 + 2Σ(xi – 1)( 1 – T)] + Σ( 1– T)

2 = Σ[(xi – 1)

2 + 2( 1– T)Σ(xi – 1) +

n1( 1– T)2. Now the middle terms of this expression 2( 1– T)Σ(xi – 1) = 0, as Σ(x – ) ≡ 0

for the values of x that select the .

The first terms, Σ[(xi – 1)2, is our within group sum of squares for group 1 and the last

sum of squares, n1( 1– T)2, is the between group sum of squares from group 1. We do the

same for the other two groups and we add up all of the within groups sums of squares and

all the between groups sums of squares and we take the ratio of the variances

(SSbg/dfbg) /(SSwg/dfwg) = MSbg /MSwg. The observed statistic is an F-statistic with the

degrees of freedom calculated as follows. [MS stands for mean sum of variance.]

The degrees of freedom or df for the sum of squares within group i is ni – 1, the df of the

SSwg = n1 – 1 + n2 – 1 + n3 – 1 = N – 3. [Were there ng groups, this sum would be N - ng.]

The df of the SST = N – 1, therefore the SSbg df is (N – 1) – (N – ng) = ng – 1 = 2 in this

example. So the df of the F-test are (ng - 1, N – ng). As an example of how this is done,

look at the following data of observations from three groups.

Group 1 Group 2 Group 3

132 122 126

132 123 133

131 120 128

138 120 129

137 124 127

132 128

132

If we use the Statistics Template, we get the following results from the ANOVA routine.

24

+-

We find that SSwg1 = (n1-1) = 6(2.82

2) = 47.7143. For all three groups, SSwg = 121.75 and

MSwg = 121.75/(18-3) = 8.12. MSbg = ∑ni( i – T)2 for i = 1, 2, 3 = 366.7/2 = 181.35. The

observed F = 181.75/8.12 = 22.34 >> 6.36, the critical value of F(2, 15, .01). Based on these

results, we conclude, at the one percent significance level, that the means differ. The actual

calculations of the means and standard deviations can be performed on the ANOVA sheet

itself in the rightmost columns and then cell references made in the spread sheet.

B. The Chi-square (χ2) test of independence.

The equivalent test for equal proportions, the χ2 test, is generally stated as a test of

independence. Let’s suppose we have groups of individuals that we believe may differ on

the basis of a particular attribute. For example, Republicans versus Democrats on their

opinion on the health care bill or young people versus old people on whether there should

be cuts in social security benefits. The null and alternative hypotheses are as follows:

H0: Attribute A is independent of Group

Ha: Attribute A is dependent on Group

α = α0

From probability theory, independence means that for two events the P(A given B) =

P(A); for example, the probability of tossing a heads on the second toss of a fair coin

equals the probability of tossing a heads = .5, regardless of what happened on the first toss.

Thus, the result A is independent of any earlier outcome B. Let us consider three groups

3

1 2 3 4 5 6 7

n 7 6 5

x-bar 133.428571 122.833333 128.6

s 2.81999662 2.99443929 2.70185122

SSwg i 47.7143 44.8333 29.2000 0.0000

Total SSwg 121.75

MSwg 8.12 (Standard error within groups) = 2.8489

X-barT 128.555556

SSbg 166.223986 196.462963 0.00987654 0 0 0 0

TSSbg 362.696825

MSbg 181.348413 **You may enter whatever you want to the right

alpha = 1.00% of column H. The small 3X10 calculator pad to

the right may be used to calculate the number

Fobs 22.3431572 of observations, the mean of the samples, and

Fc 6.35884589 their standard deviations for samples of ten or

p-value = 3.1741E-05 less.

Decision: Reject Ho

ANOVA Template for BUSN 5760

Number of Treatment Levels (up to 7) =

25

of registered voters, Republicans, Democrats and Independents and we ask them their

opinion on the health care law. The table below lists the results from the survey.

Group

Opinion

Democrat Republican Independent Total

Number

Favor 250 90 160 500

Against 100 160 240 500

No Opinion 50 100 50 200

Total 400 350 450 1200

Now, if party affiliation did not matter, we should have equal proportions of each party

in favor of, against, or neutral to the legislation. Now we know we will not get exactly the

same proportion because of sampling error, but they should be close enough that one

cannot statistically show a difference. So, for example, we have 1/3 of the group are

Democrats, so 1/3 of those in favor should be Democrats, that is 1/3 of 500 = 166.67. We

can do this for each cell on the tableau. The formula for the expected cell counts is

Nrowi*Nrowj*N = Eij. So for the 2,3 cell - independents against the law - we get

450*500/1200 = 187.5.

Notice that once we E11 and E12, E13 = Erow1 – (E11 + E12) which implies that we lose on

degree of freedom for that row. The same applies to rows 2 and 3. In addition, once I

know E11 and E21, E31 is known. Similarly for columns 2 and 3. In effect, we lose freedom

for the last row and the last column whenever we use observed cell counts to obtain row

and column totals. Therefore, df = (Nr-1)(Nc -1) = (3-1)(3-1) = 4 in this example. [The

shape of the χ2 4 degrees of freedom distribution is shown below.]

Assuming that all expected cell accounts equal or exceed 5 (remember the normal

approximation to the binomial?), we calculate the following statistic for each cell

(Oij – Eij)2/Eij. We add these up for all nine cells. The resulting statistic is distributed as a

χ2 with four degrees of freedom. We can then compare the observed with the critical value

using the χ2 table. So for α = .05, the critical χ

2 is 9.4877. The table below is the output

from the Statistics Template, Chi-square routine using the data above.

http://faculty.citadel.edu/silver/chi.pdf

26

Our observed value of 152 is much greater than the critical value of 9.49, so we reject

the null hypothesis in favor of the alternative that opinion on this legislation depends on

party affiliation. Thus, at the 5% significance level we conclude that voters’ opinion about

the health care legislation depends on the party registration.

no. of rows = 3 alpha = 5.00%

no. of columns = 3

col1 col2 col3 col4 col5

Row1 250 90 160 500

Row2 100 160 240 500

Row3 50 100 50 200

Row4 0

Row5 0

Row6 0

400 350 450 0 0 1200

1 2 3 4 5

Row1 1 166.6666667 145.8333 187.5 0 0 500

Row2 2 166.6666667 145.8333 187.5 0 0 500

Row3 3 66.66666667 58.33333 75 0 0 200

Row4 4 0 0 0 0 0 0

Row5 5 0 0 0 0 0 0

Row6 6 0 0 0 0 0 0

400 350 450 0 0 1200

1 41.66666667 21.37619 4.033333 0 0 67.07619

2 26.66666667 1.37619 14.7 0 0 42.74286

3 4.166666667 29.7619 8.333333 0 0 42.2619

4 0 0 0 0 0 0

5 0 0 0 0 0 0

6 0 0 0 0 0 0

72.5 52.51429 27.06667 0 0 152.081

Chi-sq observed = 152.0809524

Chi-sq(crit) = 9.487728465

p-value = 7.29054E-32

Decision:

Allows one to calculate the Chi-square statistic for an mxn cell contingency table

Chi-square Template

Reject Ho

Observed counts

Expected counts

(b/w 2 and 6)

(b/w 2 and 5)

27

Statistics Lecture Notes – Part 3 Multivariate models and analysis

Up to this point all of our analyses have been based on single variable models.

Sometimes we compared and performed tests of statistics from across populations, but

there always was a single variable involved. In the following analyses, we will be trying to

find whether or not there exists a relationship among variables.

I. Correlation and causality: the simple ordinary least squares regression model.

We assume:

a. the model is linear; that is, Yt = a + bxt + et

b. the error terms et are normally distributed with mean μ = 0 and constant variance σ2

and that this hold for all values of x.

c. the error terms et are uncorrelated with the x variable and with each other.

Now we take a random sample of x, y observations and use these to estimate the two

coefficients in the model, a and b. Now the estimated values, and , will not equal the

true values, but, depending upon how good a fit we get and how large a sample size we use,

the average error in the estimated coefficients will serve as good or not so good indicators

of the true values or parameters.

To understand the rationale and theory behind the process of estimating the parameters

of the model, look at the Cartesian plane below. Suppose our (x, y) points fall as depicted

in the diagram. We see that most of the points are in quadrants I and III, so that both the x

values and y values, with the exception of three of the points, exceed their respective sample

means. Thus, when x > , y > and the products (xi - )(yi - ) are positive (with three

exceptions, but even these are relatively small values). The sum of these cross-products

divided by n-1 is called the sample covariance between the two variables. In this case it is

positive.

Had the points lay mostly in quadrants II and IV, then when x > , y < , and vice

versa and the covariance would have been negative. In this case the covariance would be

negative. So, the direction of the relationship is determined by the sign of the covariance.

y-bar

x-bar

28

Note, that the covariance of a variable x with itself is then Σ(x – )*(x – )/(n-1) = s2. Now

the size of the covariance is meaningless, as one can enlarge it or reduce it by merely

changing the units of measurement of either or both variables. The way we obtain the

slope is to scale the covariance by the variance of the variable on the horizontal axis; thus

= cov(x, y)/var(x). Since both of these are calculated by dividing by n – 1, we can also

write it as = Σ(x – )*(y – y )/ Σ(x – )2.

The corresponding intercept is given as = - because it turns out that the

estimated line of best fit, using the OLS estimation technique to minimize the sum of

squared errors, always passes through the point ( , y ) [see the proofs for this proof and

for more information on OLS regression], where the estimator for b-hat is the same as

given above: = Σ(x – )*(y – y )/ Σ(x – )2.

Then the line of best fit is = + , the estimated error for each observation is = Y -

, and the sum of squared errors SSe = ∑ 2. This vales if the minimum for all choices of

and . The Statistics Template can be used to calculate the line of best fit. Let us take the

following example. We wish to determine how much training improves performance. We

measure training by the number of hours our employees are given in regular training

sessions and the performance by the number of errors made. A random sample of ten

employees revealed the following results.

Employee Hours trained Daily errors made

1 12 5

2 10 6

3 15 4

4 14 5

5 8 10

6 20 3

7 14 5

8 15 6

9 12 6

10 10 5

From the table below we find that the OLS estimated line is given by = + X

= 10.875 - 0.4135X; thus, we predict that a new employee makes about 11 daily errors

and each hour of training reduces the number of errors by .4135 per day.

We now need some sort of statistical test to see if the training has a statistically

significant effect on the number of errors made each day by the employees. To do this we

need first to ask the following questions: “If we did not use our model and the estimate

from OLS, what would we do instead?” Since we were minimizing the sum of squares,

we would want to compare our model’s performance with another technique that used

the same criterion. Also, we would presumably not have another variable to help make

our predictions; in effect, we would assume that each employee’s performance did not

depend on the number of hours of training, but was the same for all employees. In this

case, the number of errors was just random and unpredictable around the mean. Since

we don’t know µ, we use our best guess and calculate = ∑(Y - )

2/(n-1). The ∑(Y -


29

)2 is SST or total sum of squares.

Now we know what value minimizes SST, for a set of data; it’s the sample mean. So we

have two ways of predicting Y: one uses the x values, the other does not. But our

regression model could have chosen to make = 0; then minimizing squares would have

made = . In that case SSe = SST and the improvement from using the x-variable in

reducing the squared errors would be zero.

At the other extreme, if all the points lay on a sloped line, the reduction in errors would

equal the SST and the SSe would be zero. In this case we would have explained 100% of the

variability in Y. Now the measure we use answers the question: “What fraction of the

variance in Y is explained by the one variable we used in our regression?” It is called the

coefficient of determination, or more commonly R2 and is calculated as follows. [Look at

the printout above]

Simple regression Template for BUSN 5760 -0.413462

10.875

Number of observations (up to 20) = 10

Obs X Y X-Xbar (x) Y-Ybar (y) xy x^2 Y-hat e-hat e-hat^2 y^2

1 12 5 -1 -0.5 0.5 1 5.9134615 -0.913462 0.834412 0.25

2 10 6 -3 0.5 -1.5 9 6.7403846 -0.740385 0.548169 0.25

3 15 4 2 -1.5 -3 4 4.6730769 -0.673077 0.453033 2.25

4 14 5 1 -0.5 -0.5 1 5.0865385 -0.086538 0.007489 0.25

5 8 10 -5 4.5 -22.5 25 7.5673077 2.432692 5.917992 20.25

6 20 3 7 -2.5 -17.5 49 2.6057692 0.394231 0.155418 6.25

7 14 5 1 -0.5 -0.5 1 5.0865385 -0.086538 0.007489 0.25

8 15 6 2 0.5 1 4 4.6730769 1.326923 1.760725 0.25

9 12 6 -1 0.5 -0.5 1 5.9134615 0.086538 0.007489 0.25

10 10 5 -3 -0.5 1.5 9 6.7403846 -1.740385 3.028939 0.25

11

12

13

14

15

16

17

18

19

20

13 5.5 0 0 -43 104 55 -2.66E-15 12.72115 30.5

SSE SST

b-hat = -0.413462 R^2 = 0.582913

a-hat = 10.875 r = -0.763487

Y-hat = 10.875 + -0.413462 X F(obs) = 11.18065

t(obs) = -3.343748

se = 1.261009 F(prob) = 0.010174

sb-hat = 0.123652

α =t o b s = - 3 . 3 4 3 7 4 8

2.50% tc (2-tail)= 2.751531

tc (1-tail)= 2.306006

0.010174

0.005087

t-prob (2-tail) =

t-prob (1-tail) =

OLS Regression Line

2.5

2

4

6

8

10

12

5 10 15 20 25X

Y

30

The improvement is given by SSR = SST – SSE (we changed the notation so that SSE =

SSe, etc.) From the SST = Σ(Y – )2 subtract SSE = Σ(Y – )

2 =, where the are the values

predicted from the OLS model. We can now partition the SST into the SSE, which is still

unexplained, and SSR, which is the explained portion of SST. The R2 = SSR/SST. In the

regression above, R2 = (30.5 – 12.72)/30.5= .583. Now the square root of this statistic,

bearing the sign of the slope – in this case negative – is called the sample correlation

coefficient rxy. So robs = - √ = - .7635. We can now test for the significance of this

statistic; it is distributed as a t-statistic with n – 2 degrees of freedom.

Another way to test for the usefulness of the model is to test the R2 directly. This test is

based on our partition of SST into explained and unexplained variances. The unexplained

error SSE is analogous to the SSwg in ANOVA and SSR is analogous to the SSbg. The

degrees of freedom of SST is n – 1 as we lose one degree of freedom because we use

rather than μY in our sample variance. And we lose one more degree of freedom because

we use a variable x to estimate the variance in the regression model. Another way to think

of it is that we estimated two parameters, a and b, in the linear model using our data, so we

lose two degrees of freedom. The best way to understand the concept, however, is to realize

that with a straight line we can always get a perfect fit for any two points as two points

determine a straight line. Thus, we have no degrees of freedom permitting us to be off the

line.

So the SSR/1 is the average amount of error explained by the one variable, x, in the

linear model. We lose two degrees of freedom, leaving us with n – 2 degrees of freedom for

the SSE. Then SSR/[SSE/(n-2)] ~ F1,n-2. As it turns out, an F1,k is just a t2 with k degrees of

freedom. So the t must be the square root of the F, except that the t also bears a sign, in

this case the sign of . We can also calculate the F as follows: Fobs = R2/[(1-R

2)/(n-2)]

(Why?) Now the t-test, which is often more useful than the corresponding F-test that

cannot test for the direction of the relationship, is conducted as follows.

H0: ρ ≥ 0

Ha: ρ < 0

α = α0

where ρ is the population correlation coefficient. An equivalent test, which we do not

discuss here, is the test on the slope coefficient b directly. These two tests are totally

equivalent in the simple OLS model as the only source of correlation is through the one

independent variable in our model; a non-zero slope means we have correlation, a positive

slope means positive correlation, and a negative slope means negative correlation. In our

example, we expect more training to reduce errors, as we conduct a one-tail, less-than test.

From the printout above, we see that the observed t-value is -3.344 and the one-tailed

critical t-value for α = .025 is (-)2.306. Thus, at the 2.5% significance level we reject H0 and

conclude that training time and errors are negatively correlated.

IV. The OLS multiple regression model.

Extending the simple OLS regression model into mode than one explanatory variable is

straight forward. The model is as follows

31

Y = b0 + b1X1 + b2X2 + … + bkXk + e, where

e ~ N( 0, σe), e is uncorrelated with any of the explanatory variables Xi, and the other e’s

and the errors are the same for all values of the independent variables X.

OLS regression then estimates the bi, i = 1, 2, …, k so as to minimize SSE = Σ(Y – )2,

where = 0 + 1*X1 + 2*X2 + … + k*Xk is the vector of predicted Y’s. The R2 is

calculated as before: SST = Σ(Y – )2 and SSR = SST - SSE is the explained variation.

Then Fobs = (R2/k)/[(1-R

2)/(n-(k+1))] ~ Fk, n-(k+1). The test for the overall usefulness is an F-

test and the hypotheses are:

H0: bi = 0 for all i = 1, 2, … , k

Ha: bi ≠ 0 for some i = 1, 2, … , k

α = α0

To reject H0, Fobs > Fcrit = F(k, n-(k+1), α0)

If we reject H0 we should next test for which variables are explanatory; this is a series of

t-tests of the k bi coefficients. Each is of the form

H0: bi = 0 H0: bi ≥ 0 H0: bi ≤ 0

Ha: bi ≠ 0 Ha: bi < 0 Ha: bi > 0

and α = α0

While there is no multiple regression procedure in the Statistics Template, it is very easy

to use Excel to perform multiple regressions. Make sure that the vectors of explanatory

variable are in contiguous columns and then use the Analysis Tool Pack under the Tools

tab (if this is not already loaded onto your computer, use Add Ins, also under Tools, to load

it). [For Excel 2007 and later, you need to go to the icon at the top left, then got to the Excel

Options at the bottom, then select Add Ins to the left, select go at the bottom and select

Analysis Tool Pack] Then highlight the dependent Y variable and the X variables and let

‘er rip.

Example: Suppose we also have data on the ages of the ten employees that we trained in

our earlier example. The table below lists the ages of the employees along with the training

hours and the number of daily errors made after the training.

Errors TrainHrs Age

5 12 49

6 10 30

4 15 30

5 14 29

10 8 25

3 20 40

5 14 54

6 15 32

6 12 35

32

5 10 42

We regressed the first column on the second and third and got the results given above.

The two independent variables were able to explain about 2/3 of the variation in Errors.

The observed F value was 7.148 and the degrees of freedom were 2 in the numerator (the

number of explanatory variables) and 7 in the denominator (10 – (2 + 1)). From the F-

table with α = .05, the critical F is 4.737; that is, if the true slopes for all the variables were

zero, we would obtain an F value exceeding 4.737 less than 5% of the time. Therefore, we

reject the null hypothesis and conclude the model is useful.

As for which of the variables have an effect on Errors, we see that the coefficient for

TrainHrs remains about the same as before, -.378, and the t = -3.146 is still rather large (in

absolute value). The critical t-value is -t7,.05 = -1.89 (using a less than test at a .05

significance level). Thus, we conclude, at the 5% significance level, that the number of

hours of training had a negative impact on the number of daily errors made.

As for Age, the data suggest that older employees make fewer errors, but the observed t-

statistic of -1.372 is not significant at the 5% significance level (assuming a less-than test),

so we cannot conclude that age matters much. Thus, at the 5% significance level, we

cannot conclude that the age of the employee affected the number of errors made each

day.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.819327

R Square 0.671297

Adjusted R Square0.577382

Standard Error1.196747

Observations 10

ANOVA

df SS MS F Significance F

Regression 2 20.47457 10.23728 7.147921 0.020362

Residual 7 10.02543 1.432204

Total 9 30.5

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%Upper 95.0%

Intercept 12.58934 2.00798 6.269653 0.000416 7.841223 17.33745 7.841223 17.33745

TrainHrs -0.378037 0.120158 -3.146165 0.016239 -0.662165 -0.093909 -0.662165 -0.093909

Age -0.059422 0.043313 -1.37194 0.212432 -0.161841 0.042996 -0.161841 0.042996

33

Index of Terms Used in this Text

analysis of variance technique, ANOVA ................................................................................... 23

binomial distribution .................................................................................................................... 9

central limit theorem .................................................................................................................. 14

Chebychev ...................................................................................................................................... 4

Chi-square (χ2) test of independence ......................................................................................... 23

continuous distribution ................................................................................................................. 4

correlation .................................................................................................................................... 28

covariance .................................................................................................................................... 28

discrete distribution ...................................................................................................................... 8

distribution .................................................................................................................................... 4

expected value................................................................................................................................ 4

exponential distribution ............................................................................................................. 12

hypothesis test.............................................................................................................................. 15

inferential statistics ....................................................................................................................... 14

mean ............................................................................................................................................... 4

median ............................................................................................................................................ 4

mode ............................................................................................................................................... 4

multivariate models .................................................................................................................... 28

normal distribution ....................................................................................................................... 7

parameter....................................................................................................................................... 4

Poisson Distribution ...................................................................................................................... 8

probability density function (pdf)................................................................................................ 7

probability distribution ................................................................................................................ 4

qualitative data .............................................................................................................................. 4

random sampling .......................................................................................................................... 4

random variable ............................................................................................................................ 4

random variate .............................................................................................................................. 4

range ............................................................................................................................................... 4

significance level .......................................................................................................................... 15

simple ordinary least squares regression model ...................................................................... 28

standard deviation ........................................................................................................................ 4

uniform distribution ..................................................................................................................... 6

variance .......................................................................................................................................... 4

z-scores ........................................................................................................................................... 7

Documents

Lecture notes for statistics for BADM 604faculty.citadel.edu/silver/statistics.pdfStatistics for the Business Majors: Lecture notes by Stephen Jay Silver, Ph.D. Professor of Business