Upload
vodang
View
215
Download
0
Embed Size (px)
Citation preview
Statistics for the Business Majors: Lecture notes
by
Stephen Jay Silver, Ph.D.
Professor of Business Administration
The Citadel
This is a draft of a text that will be made available online in the very near future.
Please do not distribute this outside the confines of The Citadel.
2
Introduction and Preface
The content of this manual represents the essential elements of statistical theory and its
applications through problems and examples. What makes this manual unique is not so
much the content itself as the supplemental material made available to students using the
manual. These include a series of videos of studio lessons that I developed in the spring
and summer of 2008. Hyperlinks to these lecture videos appear at the end of this
preface/introduction.
In addition to what appears in this manual and the videos, underlying proofs (which are
not essential to the understanding of the material, but are presented parenthetically for the
“purists” in the class), problem sets and demonstrations of concepts are linked to the text
itself.
Also, over the years I have created an Excel Statistical Template that anyone can use to
reduce the amount of calculations, and thereby the chance for error, when applying
formulas to real-world problems. I have allowed smaller sections of my statistics classes to
use the template to work problems on exams in order to avoid computational errors and
shorten the time required for the tests.
The URL for this application is http://faculty.citadel.edu/silver/Statistics Template.xls.
When queried at first whether to save it or open it, elect to save and then open it.
Afterward you may save the file to your hard drive for later use.
The lecture notes themselves are quite condensed and do not make for light bedtime
reading; however, the videos should go a long way toward illuminating many of the points
made in the notes. So students that have difficulty understanding the material contained in
the notes are encouraged to view these videos, whose hyperlinks are shown on the next
page.
I should also mention that an alternative set of videos developed for my BADM 710 MBA
class, including topics other than statistics, is also available. I begin that course with
several lectures on optimization and the calculus. Hyperlinks to these lecture videos can be
found at my website at http://faculty.citadel.edu/silver. Click on BADM 710 and hyperlinks
can be found on that syllabus. In fact, the first statistics lecture video in the BADM 710
notes begins with my going over a problem that students may wish to ignore; they should
wait for the video to load fully and fast forward over this material.
Finally, a separate set of probability lecture notes may be included in the course; these are
found at http://faculty.citadel.edu/silver/ba604/prob_theory.pdf. Hyperlinks to the lecture
videos are given at the top of the notes as well as in the links just below this introduction.
Statistics Videos to accompany Professor Silver’s Statistics Lecture Notes
BADM 205 Business Statistics Online Video Lessons
3
Lesson 1 Part 1 Introduction to Statistics
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 01-part 1.wmv
Lesson 1 Part 2 Descriptive Statistics I
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 01-part 2.wmv
Lesson 2 Descriptive Statistics II Algebra of Summation
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 02.wmv
Lesson 3 Descriptive Statistics III Relative Location
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 03.wmv
Lesson 4 Probability I
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 04.wmv
Lesson 5 Probability II Conditional Probability, Independence, Bayes
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 05.wmv
Lesson 6 Discrete Probability Distributions I
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 06.wmv
Lesson 7 Discrete Probability Distributions II Binomial and approximations
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 07.wmv
Lesson 8 Continuous Random Variables
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 08.wmv
Lesson 9 The Central Limit Theorem and confidence intervals
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 09.wmv
Lesson 10 The Hypothesis Test of the population mean and proportion
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 10.wmv
Lesson 11 The Hypothesis Test Differences in means and proportions
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 11.wmv
Lesson 12 The Analysis of Variance Test for Differences in means of several populations
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 12.wmv
Lesson 13 The χ2 Test of Independence
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 13.wmv
Lesson 14 Simple OLS Regression
http://faculty.citadel.edu/silver/BA205/Online course/Lesson 14.wmv
Project 1 Descriptive Statistics
http://faculty.citadel.edu/silver/BA205/Online course/Project 01.wmv
Project 2 Demonstration of the Central Limit Theorem
http://faculty.citadel.edu/silver/BA205/Online course/Project 02.wmv
Project 3 Condo Case, Multiple OLS Regression model
http://faculty.citadel.edu/silver/BA205/Online course/Project 03.wmv
4
Statistics Lecture Notes – Part 1 Probability Distributions and Density Functions
I. Population, data and random variables
A population is selected and data chosen from that populationThe shape of the
population is called its distribution. A particular outcome of a drawing from a
random variable, or generator of values, is called a random variate. A random
variable does not refer to a single outcome of the sampling but to the process by
which outcomes occur.
Each distribution has one or more values, called the parameters of the
distribution. For example, the mean, median and the standard deviation are
parameters of a distribution.
Common distributions may be either continuous, such as the normal and
exponential, or discrete, such as the Poisson and the Binomial. [These will all be
covered later.]
Other distributions may not be of a standard form, but may be defined by a
listing of values and associated probabilities.
Often we do not know the distribution of the population and may draw values
from it randomly, called random sampling, to try to describe the distribution
generating the values and to estimate the distribution’s parameters.
A random sample is one in which every element within the population has an
equal chance of being selected.
II. Location and spread; two important measures of a distribution
The mean is the average value of the distribution; we denote the population mean
as the small Greek letter form, mu = μ.
Other measures of central location are the mode and the median. The median is
the value that half the population exceeds and the other half is less than. The mode
is the most frequent value, in the case of a discrete distribution, or the highest point
on the frequency plot of a continuous distribution.
For some populations the median or mode is a better measure of location than the
mean. For example, the most common color or make of automobile – the mode –
makes more sense than the mean or mode. In fact, with qualitative data such as
color, model, or gender, the other measures make no sense. And where outliers may
skew the data severely, such as income or housing prices, we usually give the median
value rather than the mean.
Measures of spread include the range or the distance between the lowest and
highest values of the distribution; the variance or the average squared deviation for
the mean; and the standard deviation, which is the square root of the variance.
We write the mean as the expected value of the random variable X or E(x), The
variance therefore can be written as σ2 = E[(x – μ)
2] and σ = √ – 2
].
For any population, Chebychev proved that the maximum fraction of the
population, for any distribution, that can lie more than kσ away from the
population mean is 1/k2 for k >1; thus, at least 1 – 1/k
2 must lie within kσ of μ.
The Empirical Rule for mound-shaped populations relies on the areas under the
5
normal distribution curve, the so-called “bell curve”: about 2/3 will be within one σ,
about 95% within two σ, and almost the entire population within three σ from the
mean.
III. Using data from random samples to estimate the mean and variance.
Let us draw a random sample from a population. This means that each element
in the population has an equal chance of being selected. Call these observations x1,
x2, …, xn. Then we estimate the mean and variance as follows.
i) = (x1 + x2 + … + xn)/n = x1/n + x2/n + … + xn/n. This can also be
written as = ∑x/n [we omit that we are adding up n x’s as this is
understood. To refresh yourself on summation notation, go to the algebra
of summation page]
ii) s2 = [(x1 - )
2 + (x2 - )
2 + … + (xn - )
2]/(n-1)
= Σ(xi – )2/(n-1). So it is approximately the average of the squared
deviations from the mean.
The reason we divide by n-1 instead of n is that the estimated result would be too
small on average because we use our estimate of the population mean, , instead of
the actual value μ, which is unknown. It can be shown that on average dividing by
n-1 gives an unbiased estimate, meaning on average our estimated variance equals
the true variance; that is, E(s2) = σ
2. Two important characteristics of the variance
estimate s2 are its unbiasedness as an estimator of μ, and that using minimizes the
sum of squared deviations, which is the numerator of s2. [Proof] An alternative
formula for s2 is (Σx
2 – n* 2)/(n-1). [Proof]
Also, it is interesting to understand the function of the median as a measure of
location and when it is more appropriate to use the median rather than the mean.
Let us look at the following statistic, called the mean absolute deviation. Suppose
we have a measure of centrality x* and we take the absolute differences of our
sample data from that point. The sum of these absolute differences is Σ|x – x*| and
the mean absolute difference, or MAD, is the sum above divided by n. The median
minimizes MAD. [Proof]
IV. The mean and standard deviation of a generic probability distribution
The table below is an example of a probability distribution. The values of X are
the number of absences per semester of my BADM 710 students. Thus, 40% of my
students miss none of my classes, 30% miss exactly one class, etc.
X P(X)
0 .4
1 .3
2 .2
3 .1
6
Note that the P(X) are all non-negative (for all Xs that do not occur, we assume
P(X) = 0), and = ∑P(X) = 1. Any distribution having these characteristics is said to
be a probability distribution or PD.
The mean of a PD is the weighted average of the X values, where the weights are
the respective probabilities. Thus, for the above distribution, μ = ∑X*P(X) = 0(.4)
+1(.3) + 2(.2) +3(.1) = 1. [See the table below for calculating the mean and variance]
The variance of a distribution is the weighted average of the squared differences
from the mean; thus, σ2 = ∑(X – μ)
2*P(X) = (-1)
2(.4) + 0(.3) + 1
2(.2) + 2
2(.1) = .4 + 0
+ .2 + .4 = 1. So σ = 1.
X P(X) XP(X) (X-μ)2P(X) X
2P(X)
0 .4 0 (0-1)2(.4) = .4 0
1 .3 .3 (1-1)2(.3) = 0 .3
2 .2 .4 (2-1)2(.2) = .2 .8
3 .1 .3 (3-1)2(.1) = .4 .9
SUM 1 1 2
Another formula for σ2 is ∑(X)
2P(X) – μ
2 = 2 – 1
2 = 1. [Proof]
V. Two simple distributions
Uniform distribution. This distribution may be either discrete or continuous.
For both the discrete and continuous uniform distributions, μ = (a+b)/2, where a
is the smallest value and b is the largest; for any symmetric distribution the mean
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
P(X)
1 2 3 4 5 6 7 8 9 10
X
Discrete Unifiorm
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
f(X)
1 2 3 4 5 6 7 8 9 10
X
Continuous Unifiorm
7
and median are the same value.
For the discrete uniform distribution σ2 = d
2(n
2 – 1)/12, where d is the distance
between successive values and n is the number of x values; for the continuous
uniform distribution σ2 = (b – a)
2/12. [Proofs]
VI. A very important continuous distribution is the normal distribution, or bell
curve.
The probability density function (pdf) of the normal distribution is
f(x) = 1/(σ√√ )*
.
While this looks ugly, it actually has nice properties. For example, whether x-μ is
positive or negative, when squared it has the same value; thus, the distribution is
symmetric around x = , where [(x-μ)/σ]2 is minimized, but since we take e to the
negative power, f(x) is maximized at the mean.
Also, it can be shown that the function ef(x)
has a slope (derivative) at all real f(x),
so it cannot come to a point, but is smooth. Finally, since the value of f(x), which is a
power of a positive number e, must always be positive, f(x) never crosses the x axis.
All these together imply a bell-shaped distribution. The positive constant term in
front, 1/(σ√ ), merely standardizes the area under the curve to be 1. The only
other issue is why the .5 in the exponent. Well, it just is. In future we will indicate
that a random variable X is normally distributed with mean μ and standard
deviation σ as follows: X ~ N(μ, σ).
To find probabilities we standardize the normal using z-scores, where Zz = (x –
μ)/σ is the number of standard deviations x is from the mean.
IQ tests are normally distributed and have a mean of 100 points and a standard
deviation of 15 points. That is, the average child has a 100 IQ and on average a
child’s IQ differs from the mean by 15 points.
Z-scores and using the table; find P(IQ > 115). Since 115 differs from the mean
Standard Normal Distribution
0.00
0.10
0.20
0.30
0.40
0.50
-3.00 -2.50 -2.00 -1.50 -1.00 -0.50 0.00 0.50 1.00 1.50 2.00 2.50 3.00
X
f(X
)
8
by one standard deviation (the z-score of 115 = (115-100)/15 = 1) we look on the
normal table for 1.00. We see the area is .3413, which means that 34.13% of the
population lies between Z = 0 and Z= 1. Since half the population lies to the right of
the mean (it’s symmetric), .5 - .3413 = .1586, or 15.87% of children have IQs above
115 (and below 85)
a. Find following probabilities
P(IQ < 105) (.5 + .1293 = .6293)
P(IQ < 80) (.5 - .4082 - .0918)
P(110 < IQ < 130) ( .4772 -.2486 = .2286)
P(80 < IQ < 115) (.4082 +.3413 = .7495)
P(80 < IQ < 90) (.4082-.2486 = .1596)
VII. Continuous distribution vs. discrete distribution mean and variance
Again, the mean of a discrete PD is μ = ∑X*P(X) and the variance
σ2 = ∑(X – μ)
2*P(X)
The equivalent for a continuous distribution is to substitute in both formulas
f(x)dx for P(X) and to use the integral instead of the summation; thus ,
μ = ∫
and σ
2 =∫ –
2
f(x)dx.
Another PD example:
a. Let P(X = k) = (1/2)k , k = 1,2,3, …
b. Σ P(X) = 1, all P(X) ≥ 0, thus P(X) is a probability distribution
c. Find μ and σ (Don’t try it; it’s not for the faint of heart. Use Excel
spreadsheet to estimate it to three decimal places) We see that μ = 2
and σ2 = 6 – 2
2 = 2. For a formal proof, click HERE.
VIII. The Poisson Distribution
P(X = k) = e-λ
λk/k!; μ = λ, σ
2 = λ (See Excel estimates for λ = 1 and λ = 2)
Example for λ = 3 using the Poisson Table below
k P(X=k) X*P(X) (X-3)^2*P(X) CumP(X)
0 0.0498 0.0000 0.4481 0.0498
1 0.1494 0.1494 0.5974 0.1991
2 0.2240 0.4481 0.2240 0.4232
3 0.2240 0.6721 0.0000 0.6472
4 0.1680 0.6721 0.1680 0.8153
5 0.1008 0.5041 0.4033 0.9161
6 0.0504 0.3025 0.4537 0.9665
7 0.0216 0.1512 0.3457 0.9881
8 0.0081 0.0648 0.2025 0.9962
9 0.0027 0.0243 0.0972 0.9989
0.9989 2.9886 2.9400
Poisson Distribution with lambda = 3
μ approaches 3, so does σ
9
Uses of the Poisson distribution: Cars at a toll plaza; customers at a bank; phone
calls at a switchboard.
i) If three cars per minute arrive at a toll plaza, and arrivals are
Poisson-distributed, use the table above to find the probability
that during the next minute
no cars arrive (P(k = 0) = .0498)
exactly two arrive (P(k = 2) = .2240)
more than five arrive (1 – P(k ≤ 5)) = 1 - .9161 = .0839)
between two and four cars, inclusive, arrive
(P(k ≤ 4) – P(k ≤ 1) = .8153 - .1991 = .7162)
ii) Approximation of the Binomial distribution, which we discuss in
the next topic.
IX. The Binomial Distribution [A related topic for sampling from finite
populations is covered in the Hypergeometric Distribution]
The binomial distribution is used when only one of two possible outcomes will
occur on a trial of an experiment conducted from an infinite population; thus, two
values are possible. We assign the values of 1 for a “success” and 0 for a “failure”.
Thus:
“Success” with probability p; X = 1 and “Failure” with prob. 1 - p; X = 0.
Examples are tossing a fair coin, finding defects when sampling relatively few parts
in a factory, etc. The parameters are μ = p, the probability of a success, and σ2 =
p(1-p) (See table below)
X P(X) XP(X) (X-p)2P(X)
0 1-p 0 (-p)2(1-p) = p
2(1-p)
1 p p (1-p)2 p
Sum p p(1-p)
For the binomial distribution for n trials, let Y = ΣX; thus, Y = number of
successes in the n trials. Statistically, it can be shown that E(ΣX) = ΣE(X) = np = μ.
And, if the trials are independent, Var (ΣX) = ΣVar(X) = nσ2 = np(1-p). The
probability of k successes in n trials is given by P(X=k) = p
k(1-p)
(n-k).
is the
0
0.05
0.1
0.15
0.2
0.25
Pro
ba
bilit
y
0 1 2 3 4 5 6 7 8 9
x
Poisson, lambda = 3
10
number of ways you can obtain k successes in n independent trials and equals
n!/[k!(n-k)!]; the rest of the expression is the probability of each way. These
calculations will be unpleasant to do for fairly large n and k, so tables for particular
values are available in most textbooks.
Example: for n = 20, p = .6, see the graph and cumulative table below for the
probabilities for this binomial distribution. Use the table to find the following
probabilities for the number of successes in 20 trials:
i) exactly ten (Hint; P(X ≤ 10) – P(X ≤ 9) (.1171)
ii) more than thirteen (Hint: 1 - P(X≤13) (.2500)
iii) between ten and fifteen, inclusive (.8215)
0
0.05
0.1
0.15
0.2
P(X = k)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
Binomial Distribution: p = .6, n = 20
Binomial Distribution; p = .6, n = 20
k P(X=k) P(x<k)
0 0.0000 0.0000
1 0.0000 0.0000
2 0.0000 0.0000
3 0.0000 0.0000
4 0.0003 0.0003
5 0.0013 0.0016
6 0.0049 0.0065
7 0.0146 0.0210
8 0.0355 0.0565
9 0.0710 0.1275
10 0.1171 0.2447
11 0.1597 0.4044
12 0.1797 0.5841
13 0.1659 0.7500
14 0.1244 0.8744
15 0.0746 0.9490
16 0.0350 0.9840
17 0.0123 0.9964
18 0.0031 0.9995
19 0.0005 1.0000
20 0.0000 1.0000
11
From the discussion above, we see that we can now use four different methods to
calculate binomial probabilities: the binomial formula; the binomial tables (or Excel
for n or p not listed in the tables); the Poisson (for large n and small p); and the
normal (whenever np and n(1-p) are both ≥ 5).
We have developed an excel spreadsheet that will allow the user to determine
which method to use for given set of n and p. The sheet is found in the Statistical
Template; you need to go to the sheet titled “Three Distributions”. Below is the plot
for n = 1000 and p = .01. We see that the preferred approximation technique is the
Poisson, whose probability values track the binomial probabilities almost perfectly
for all values of X.
IX. One more distribution; the (negative) exponential distribution
This distribution is related to the Poisson; it is the time between successes rather
than number of successes. This is the inverse function of the Poisson, which gives
occurrences per period of time.
As time is continuous, the exponential distribution is continuous. f(x) = λe–λt
,
where λ ≥ 0; λ is the same parameter as for the Poisson, that is, the mean number of
successes per time period. The mean is 1/ λ, and P(Time > t) = e- λt
.
Example: on average customers arrive at a bank every two minutes, that is, the
mean number of arrivals per ten minute period is five.
From the table below find the following probabilities for the time between the
next two arrivals
i) more than half a minute (Hint: half a minute = .5/10 = .05 10-minute
Comparison of Three Distributions: Binomial with p and n
and the approximate Normal and Poisson Distributionsp = 1.00E-02
n = 1000np = 10
24
n Binomial Poisson Normal
-2 0 0 9.31E-05
-1 0 0 0.000295
0 4.31712E-05 4.54E-05 0.000844
1 0.000436073 0.000454 0.002185
2 0.002200188 0.00227 0.005119
3 0.007393223 0.007567 0.010851
4 0.018613745 0.018917 0.020809
5 0.037453112 0.037833 0.0361
6 0.062737115 0.063055 0.056658
7 0.089986568 0.090079 0.080448
8 0.112824069 0.112599 0.10334
9 0.125613329 0.12511 0.120093
10 0.125740211 0.12511 0.126261
11 0.114309283 0.113736 0.120093
12 0.095161516 0.09478 0.10334
13 0.073053285 0.072908 0.080448
14 0.052022794 0.052077 0.056658
15 0.034541734 0.034718 0.0361
16 0.02147955 0.021699 0.020809 0
17 0.012558454 0.012764 0.010851 0
18 0.006927587 0.007091 0.005119 0
19 0.003616635 0.003732 0.002185 0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
-2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Binomial
Poisson
Normal
12
periods, and λ = 5 per 10-minute period) ( .7788)
ii) less than one minute (.3935)
iii) between one and two minutes (.6065 – .3679 = .2386)
Exponential Distribution; λ = 5
0
1
2
3
4
5
6
00.
10.
20.
30.
40.
50.
60.
70.
80.
9 11.
1
Exponential Distribution; λ = 5
t f(t) dt*f(t) P(T>t) = e- λt
0.00 5.000 0.2224 1.0000
0.05 3.894 0.1732 0.7788
0.10 3.033 0.1349 0.6065
0.15 2.362 0.1050 0.4724
0.20 1.839 0.0818 0.3679
0.25 1.433 0.0637 0.2865
0.30 1.116 0.0496 0.2231
0.35 0.869 0.0386 0.1738
0.40 0.677 0.0301 0.1353
0.45 0.527 0.0234 0.1054
0.50 0.410 0.0183 0.0821
0.55 0.320 0.0142 0.0639
0.60 0.249 0.0111 0.0498
0.65 0.194 0.0086 0.0388
0.70 0.151 0.0067 0.0302
0.75 0.118 0.0052 0.0235
0.80 0.092 0.0041 0.0183
0.85 0.071 0.0032 0.0143
0.90 0.056 0.0025 0.0111
0.95 0.043 0.0019 0.0087
1.00 0.034 0.0015 0.0067
1.05 0.026 0.0012 0.0052
1.10 0.020 0.0009 0.0041
1.15 0.016 0.0004 0.0032
Total P 1.0024
13
Statistics Lecture Notes – Part 2 Inferential Statistics of one variable
In this section of the Notes we will address the most important use of statistics:
making inferences about the populations being investigated. We begin by discussing
the central limit theorem and its applications in statistical inference.
I. The Central Limit Theorem (CLT)
This is without doubt the most important theorem in statistics – I like to refer to it
as the fundamental theorem of statistics – and almost all of what we do in statistics
is based on the CLT.
The central limit theorem states that, for large sample size n, the distribution of
the sample mean for an independent random sample of any distribution is
approximately normal with mean μ and standard deviation σ/√n. Practically, for
most distributions, large means n ≥ 30. [See CLT]
Thus, for large n, the statistic ∑x/n, and therefore ∑x, is normal. This implies that
for large n the distribution of the number of successes divided by the number of
trials, which is the sample proportion of successes, for a binomial experiment is
normal with mean μ = p, and = σ/√n = √ √ n = √ Large
means that np ≥ 5 and n(1-p) ≥ 5.
∑x is also normal, and its mean is just n times p or np and the standard deviation
is also n times as large, or √ .
Applications: Toss a fair coin 100 times. np = n(1-p) = 50; μ = np = 50; σ =
√ = 5. From the CLT the observed number of successes is approximately
normal with mean = 50 and σ = 5. Use the normal approximation to find the
following:
i) P(X = 50) [Hint: 50 lies between 49.5 and 50.5]
ii) P(X > 55) [Again use the continuity adjustment of .5 units]
iii) P(45 ≤ X ≤ 58)
Now p = .5, so is normal with mean μ = .5, and σ = .05; therefore find the
probability the sample proportion is
iv) more than 60 percent
v) less than 45 percent
vi) between 45 and 50 percent
II. For a binomial random variable with large n and small p, and np < 5.
A better approximation is often obtained using the Poisson with λ=µ=np. [Proof]
Example: Each day an inspector selects 200 parts randomly. If one percent of the
parts are expected to be defective, find the probabilities:
14
i) none are defective [λ= 2, P(x = 0) = .135]
ii) exactly two are defective [Hint: P(x ≤ 2) – P(x ≤ 1)]
iii) more than six are defects [1 – P(x ≤ 6)]
iv) between two and five, inclusive, are defects [You do it]
III. Applying the CLT to a large random sample
Each month the Bureau of the Census conducts a random sample of 50,000 U.S.
households for the Labor Department. Let’s suppose the survey includes two
questions:
b. “What do you estimate your household’s income was last month?”
c. “How many members of your household were unemployed last month?”
[In fact, the questions are a bit more sophisticated than these, but let’s suppose
this is the tenor of the questions]. Now suppose the survey found the following:
The mean for salaries was $4100, with a standard deviation s of $1050, and
Of the 55,600 household members determined to be in the labor force, 5,060
were unemployed.
From sampling theory, we know that for the salary data is normally-distributed
with mean μ and standard deviation = σ/√ . While we don’t know the true σ,
due to the size of the sample, s ≈ σ and σ/√n ≈ 1050/√ = $4.70. That is, the
mean of the population, whatever it is, differs from the $4100 we got from the
survey by about $4.70. That means our survey value is really quite accurate.
We are now ready to construct what we call a confidence interval for μ. We know
that the true average monthly salary is not the sample mean of 4100. But we also
know it is quite close. Now from the normal distribution, we can find what fraction
of the population lies within a given number of standard deviations from the mean.
In fact, we see from the Z table that .4750 of the population lies within 1.96 standard
deviations on each side of the mean.
Thus, an interval around the mean that is 1.96σ/√ will contain 95% of all sample
means drawn from samples with n = 50,000. But to say this means that there’s less
than a 5% chance that the true mean lies beyond 1.96($4.70) = $9.21 from the
observed mean of this particular sample. Constructing this interval yields the
following result: we are 95% confident that the true mean of the population,
whatever it is, lies inside 4100 ± 9.21 = (4090.79, 4109.21). Thus, the formula for a
95% confidence interval is ± 1.96σ/√ . In general, for any confidence interval
when σ is known, is given by ± (σ/√ ), where α is the area in the two tails and
1 – α is the area under the curve within the confidence interval.
Similarly, for the true unemployment rate of the population we use sampling
theory to tell us that the standard deviation of the sample proportion – the expected
difference between the true population proportion and the proportion obtained
from a sample of size n – is √ . Of course we don’t know the actual
proportion p of the population that is unemployed. But again, since the sample size
is very large, 55,600, we can assume that the observed sample proportion is quite
close to the true proportion.
From our sample, = 5060/55,600 = .091. So √ =
15
√ = .0012, or a little over one-tenth of one percent. Again, to
construct a 95% confidence interval for p, we get .091 ± 1.96(.0012) = .091 ± .0024 =
(.0886, .0934). In fact, this is quite a wide interval for a statistic as important as this
month’s unemployment rate. That explains why one needs to be very careful in
interpreting any one month’s estimate as being important. It’s only by watching
this statistic over time that we can be sure what is happening in the labor market.
On average this statistic is off by about 1.2%. Often the researcher will say
something like “last month’s unemployment was 9.1% ± .24%”. Generally, surveys
give the error associated with a 95% confidence interval. They will simplify this by
using 2σ rather than 1.96σ
Example: In a hotly contested election, candidate A’s statistician surveys 1000
potential voters and finds that 512 of the respondents say that if the election were
held today, they would vote for A’s opponent. The remaining 488 voters say they
are going with A. Construct a 95% confidence interval for A.
We used the 95% confidence interval. But we could just as easily have asked for
a 90% or a 99% confidence interval. For the 95% interval the z-score used was
1.96; find the appropriate z-score for the 90% and the 99% intervals. [1.645 and
2.576]
IV. Small sample confidence interval for the population mean
In the previous examples we constructed our confidence interval based on two
facts about large samples: by the CLT, is normal and s is a close approximation
of σ. But often we do not have the luxury of a large sample. So what do we do in
those instances in which we have only a small sample to go on?
Well, we saw from the CLT graphs that in the case of one particular distribution,
the normal, the sample mean for any size n is normally distributed. So, if the parent
distribution from which our sample is drawn is normal, or very close to normal, we
can assume that even for very small samples the sample mean is distributed
normally.
The second issue about how close the sample standard deviation is to the
population standard deviation still remains. Fortunately there is a way to adjust the
Z to produce confidence intervals. The individual that first determined the
corrected intervals was William Gossett, who worked for Guinness Breweries in
Ireland and published under the pseudonym Student. The distribution that bears
his name is called the Student t distribution.
Now, anytime there is additional error created, if that error is balanced between
being too large and being too small, the implication is that the interval will have to
be wider. And since E(s2) = σ
2, the error is balanced. Therefore, the t exceeds the
corresponding z-score.
Now the reason we need to adjust the width of the interval is that s and σ differ.
The magnitude of that difference is based on the size of the sample; for very large
samples the error is very small and we can use the Z. That is, for large n, t ≈ Z. But
for small n the difference between s and σ will be quite large on average and t will
be considerably larger than Z. The measure that determines the amount of
16
adjustment needed is referred to as the degrees of freedom. In the case in which we
estimate the sample variance using the sample mean we divide the sum of squares
by n – 1, so n – 1 is the degrees of freedom and we use . In this case α is the
area in the two tails and 1 – α is the area inside the tails.
To understand how to construct a small-sample interval, let’s do an example.
Suppose we obtained a random sample of 10 IQ scores from children in a particular
town and the mean was 108 and the standard deviation s was 16.5. Now we know
that IQ scores are normally-distributed nationally and therefore will be very close to
normal within the town. Then the 90% confidence interval will be
± (s/√n) = 108 ± (6.5/√10).
To find the t, go to the t-table, .05 column, and ninth row. [See the t table] We see
that t = 1.833, so the interval is 108 ± 1.833 *(16.5/√10) = 108 ± 1.833*5.22 = 108 ±
9.56 = (98.44, 117.56). Thus, we are 90% confident that the mean IQ of children in
this town lies within this interval.
V. The hypothesis test of the population mean and proportion.
Another way to address the issue posed in the analysis above is to ask whether or
not some a priori assumption about the population’s mean or proportion is
supported by the data gathered. In the example just above, we might ask whether
or not the mean IQ for children in the town in which the data were collected exceeds
the national average of 100. The observe IQ suggests this might be the case;
however, it may just be that the sampling error was large.
We would then set up the following hypothesis test:
H0: μ ≤ 100
Ha: μ > 100
α = .05
This particular test is a one-tailed, greater than test. H0 is called the null
hypothesis, and is what is not covered by the alternative hypothesis Ha. Ha states the
hypothesis we are trying to prove and does not contain equality in it; the three
possibilities for Ha are μ > μ0, μ < μ0 and μ ≠ μ0. H0 always has equality: the
corresponding H0 are μ ≤ μ0, μ ≥ μ0 and μ = μ0.
No matter how large a sample we obtain, we can never by 100% sure if we reject
the null hypothesis in favor of the alternative that we are correct. The α, or
significance level, is the probability that we incorrectly reject the null hypothesis. In
other words, we can tolerate falsely rejecting the null hypothesis α of the time. In
the IQ example above we are trying to prove that the average IQ in our town is
above that of the population as a whole.
Suppose we are willing to accept a false rejection level of 5%. So only if our
observed sample mean is too far above the hypothesized mean of 100 can we be 95%
confident that we are not falsely rejecting it.
Now the z-score that we need to obtain = ( – μ)/(σ/√ ), must be large enough
that the probability of exceeding it by chance is less than 5%. We call this z-score
the critical z; in this case, had the sample been very large, Zc is 1.645. (see the Z
table for .4500). But in our example, we don’t know σ; so we need to revise this
17
formula to be = ( – μ0)/(s/√ ) and tc = = 1.833. If the observed t exceeds
1.833, we reject H0 and conclude the town’s mean IQ exceeds 100. We get (108 –
100)/(16.5/√ ) = 1.533 < 1.833, so we fail to reject H0. That is not to say we accept
H0; our best guess still is that the mean exceeds 100. After all, 108 is considerably
larger than 100, but the sample was just too small to reach that conclusion at the
5% significance level.
Let’s now consider the election example where candidate A has only = 488/1000
= .488 of the sample in his favor. We are trying to decide whether or not to throw in
the towel. Suppose we conclude that if she has less than a 5% chance of winning the
election, we will stop throwing good money after bad. Then our test will be
conducted as follows:
H0: p ≥ .50
Ha: p < .50
α = .05
If we reject H0, we quit. Now our zobs = ( – p)/√ = (.488-
.5)/√ = - .012 / 0.01581 = - .76. To reject H0 we needed for
Zc < -1.645, so A should carry on the fight. Are you getting the hang of it yet?
VI. Two population tests of hypotheses.
So far we have limited our analyses to one population, one variable. In this
section we discuss comparisons of means and proportions on one variable across
two populations. We begin with a general discussion of distribution theory.
Suppose two random variables are normally distributed Xi ~ N(μi, σi), i = 1, 2.
Then X1 ± X2 ~ N(μ1 ± μ2, √(σ12/n1 + σ2
2/n2)) and [(X1 - X2) – (μ1 - μ2)]/
√
~ N(0,1) is a Z score.
From the sampling distribution of the sample proportion, it is left as an exercise
to show 1 – 2 ~ N(p1 – p2,√
. Assuming we are testing for p1 – p2
= 0, Zobs = ( 1 – 2)/√
, where = (n1 1 + n2 2)/(n1 + n2). Because we
don’t know what the values are for p1 and p2 and we are testing for the equality of
the two proportions, we use the best guess we have for the common p; that is the
given above.
[As an exercise, show that is both a weighted average of the two observed ,
where the weights are proportional to the sample sizes and sum to one, and is the
same as pooling the observed number of successes from the two samples divided by
the total number of observations in the samples.]
- Two-tailed test of differences in proportions:
H0: p1 = p2
Ha: p1 ≠ p2
18
α = .05
Example: Test for a difference in proportions of two stores selling the store brand of
a product if we obtain the following data in random sample of the two stores.
Number of
total units sold
Number of store
brand units sold
Store 1 1000 350
Store 2 1500 600
From these data, 1 = 350/1000 = .35, 2 = 600/1500 = .4, = (350+600)/(1000+1500)
= 950/2500 = .38. Now rather than calculate the observed Z, we use the Statistics
Template to do the analysis. Go to the Diff proportion sheet and input the values.
You should get the following output.
Since Zobs = 2.523 and Zcrit = ± 1.96, we reject H0 and conclude the proportions
differ at the two stores,
- One-tailed test of differences of means, large samples:
Conduct the two-tailed test of the differences in daily receipts of two stores if
random samples show the following:
Suppose we wish to test, at the 10% significance level whether receipts at Store 2
exceed those at Store 1; that is:
Number of days Average receipts Standard deviation
Store 1 49 $4200 $2100
Store 2 36 $5000 2400
Large -Sample Test for difference between 2 Population Proportion 1 2
Sample size n = 1000 1500 p - hat (pooled) = (x1 + x2)/(n1 + n2) = 950 / 2500
No. Successes (x)= 350 600
= 0.38
p-hats = 0.35 0.4
z = p1 - p2 = -0.05 = Z TS =
-2.523
√ p(1-p) (1/n1 + 1/n2) 0.019816
alpha = 5.00%
alpha/2 = 2.50%
p-value (1 tail) = 0.58%
p-value (2 tail) = 1.16%
t-critical (1-tail) = 1.645
t-critical (2-tail) = 1.960
19
H0: μ2 ≤ μ1
Ha: μ2 > μ1
α = .10
Now since for large sample (n>30), s ≈ σ, Zobs = (4200 – 5000)/√
. Rather
than do the calculations, again we may use the template.
Two sample z-test Template
n1 = 49 n2 = 36
x-bar1 = 4200 x-bar2 = 5000
s1 = 2100 s2 = 2400
tobs = -1.600
p-value(1) = 0.055
p-value(2) = 0.110
alpha = 10.00%
t-crit (1-tail) = 1.282
t-crit (2-tail) = 1.645
Since the observed s-score of -1.60 is less than the critical z of -1.282, larger in absolute
value and negative as it must be, we reject H0 and conclude, at the 10% significance level
that μ2 > μ1.
- Small sample differences in mean.
In comparing means taken from small samples, we again encounter the same problems
we had with the test of the mean: the issue of normality and not knowing the true
variances. The first is easily dealt with if we can assume both populations are
approximately normally distributed. If we were comparing exam scores on standardized
tests, for example, this assumption makes sense.
The second issue is more nettlesome. To see why, look at the test statistic from the large
samples case Zobs = ( 1 – 2)/ √
. In that case we could rely on the large sample
sizes to claim that the sample variances were approximately the same as the population
variances and therefore the Z is distributed approximately normal.
But with small samples the distribution of the test statistic is t-distributed. But with the t
we need to know the degrees of freedom. In this case, however, we have two sources of
error, from each of the two samples, with (n1 – 1) and (n2 – 1) degrees of freedom,
respectively. While both create greater uncertainty, the smaller sample of the two is of
greater concern since that requires greater adjustment from the Z. There are two ways to
deal with the problem; but in the case that we feel the two population variances are about
the same, as would be the case in comparing standardized test scores, we make that
20
assumption (the test for equality of variances will be shown in the next section) and
estimate the common variance as follows:
sp2 = [(n1–1) s1
2 + (n2–1) s2
2]/( n1–1 + n2–1) = [(n1–1) s1
2 + (n2–1) s2
2]/( n1 + n2 – 2)
We then use this value in place of the two variances in the formula; this statistic is
distributed as a t with n1 + n2 – 2 degrees of freedom. To see why we use this formula to
estimate the common variance, recall that s2 = ∑(x – )
2/(n-1) for each sample, or (n-1)s
2 =
∑(x – )2.
Thus, sp2 is a weighted average of the two variances with weights proportional to the
degrees of freedom and summing to one. But it is also the same as throwing all the sum of
squares into one basket and dividing by the sum of the two degrees of freedom. It makes
perfectly good sense to do it this way as we assumed each of the squared differences from
the respective population means is the same size. We lose two degrees of freedom from n1 +
n2 because we allowed each of the samples to select its own mean.
Suppose we wanted to test for the equality of two population means drawn from two
normally-distributed populations with equal variance. First we assume both populations
are normally distributed with equal variances. The sample data are given below.
Using the template (2Sample t-test) we obtain the following results for a two-tailed, 10%
significance level test.
Since the tobs of 2.62 exceeds the critical t of 1.69, we can reject H0 and conclude the
population means are not equal. Note that the degrees of freedom of this test is 36, large
enough to use the Z test; the z critical would have been 1.645, only slightly less than the
correct t of 1.688.
Two sample t-test Template for BUSN 5760
n1 = 16 n2 = 22
x-bar1 = 125 x-bar2 = 108
s1 = 22 s2 = 18
sp = 19.76529
tobs = 2.617733
p-value(1) 0.006432
p-value(2) 0.012865
alpha = 10.00%
t-crit (1-tail) = 1.305514
t-crit (2-tail) = 1.688297
n s
Pop. 1 16 125 22
Pop. 2 20 108 18
21
In the next section, we discuss a test of equal variances. In conducting that test, we may
find that the assumption of equal variances is rejected, in which case an alternative
method must be used. For a discussion of what we may do in this case, read how we can
estimate the appropriate degrees of freedom for the t test without assuming equal
variances here. The small sample difference of means sheet in the Statistical Template
conducts the test of variances automatically. These topics, the F-test of equal variances
and calculating the approximate degrees of freedom, are not covered in the lecture videos.
VII. Test of equal variances
This test relies on the fact that if two populations have equal variances then the ratio of
the sample variances should be one. But since the observed variance for each sample is
not equal to the corresponding population variance, the observed ratio will not equal 1.
Now, take the ratio of the larger of the two to the smaller variance and compare the
quotient to the test statistic, the F statistic with nℓ-1, ns–1 degrees of freedom, where nℓ
represents the number of observations in the sample with the larger variance and ns is the
number of observations in the other sample.
The test is conducted as follows:
H0: σ1 = σ2
Ha: σ1 ≠ σ2
α = α0
In this case we want to fail to reject the null hypothesis in order to use the assumption
that the variances are the same. The assumption is that both populations are normally-
distributed, the same assumption we made to compare the means.
The F distribution has the following appearance:
The area to the right of the ratio of the two variances is the probability that you obtain a
given ratio if the population variances are actually equal. In our case, the ratio is (22/18)2 =
1.493827; the probability of getting this large an observed F-statistic can be found in Excel
F(21,15)-Distribution
0
0.01
0.02
0.03
0.04
0.05
0 0.5 1 1.5 2 2.5 3 3.5
Variance Ratio
F
alpha
22
as follows: “=fdist((22/18)^2,21,15)”. The answer is 0.214944. The critical F value is found
in the F-table with the appropriate significance level. Since we used .10 for the test of the
means, we can use the same α for the F-test. The F-Tables hyperlinks are found at
http://faculty.citadel.edu/silver/f_tables.htm; now click on α = .1. While we cannot find 21
df in the numerator, we do see 20 and 25. The critical F value for (20, 15) is 1.92 and for
(25, 15) it is 1.89. Our observed F is well below either if these two values, so we fail to
reject the null hypothesis and can conclude the variances are about equal, which is what we
wanted to show.
VIII. More than two populations comparisons of means and proportions; ANOVA and χ2
tests.
The analysis of variance technique, ANOVA, is used when we wish to make inferences
about the equality of means of three or more populations. As was the case for the two
population case, we assume all populations are normally-distributed and the variances are
equal. The hypotheses are:
H0: μi = μj, all i, j
Ha: μi ≠ μj , some i, j
α = α0
That is, either all the means are the same or two or at least two means differ.
The two graphs above show what the distributions will look like under the two
hypotheses: the one to the left is what we would get if the means are approximately equal,
the one to the left represents the distribution of observed values if there are significant
differences in the population means.
If the first case exists, then the sums of squared differences calculated from the means of
the respective samples, the i, i = 1, 2, 3 in this case, will be insignificantly different, on
average, than if calculated from the overall sample mean T. On the other hand, if the
means of the three populations are not all the same, and at least one of them differs
significantly from at least one of the others, then calculating the sum of squares from the
overall mean rather than from the group means will result in a significantly larger sum. In
this latter case we can partition the errors into two parts, what we call the within groups
sum of squares, SSwg, and the between groups sum of squares, SSbg, and conduct the F-test
on the ratio of these two sums of squares after each sum of squares is divided by its degrees
of freedom. Thus, the F is the ratio of two variances as we did earlier.
To see how this is done, suppose we start by summing up the total sum of squares, SST =
Σ(x – T)2 for all i = 1, 2, … , N, where N is the number of observations in total in our
sample = n1 + n2 + n3. Alternatively, we could calculate this total sum as follows: for the
Means approximately equal
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Pop 1
Pop 2
Pop 3
Means clearly differ
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45Pop 1
Pop 3 Pop 2
23
first group SSbg1 = Σ[(xi – 1) + ( 1 – T)2] for the x values in group 1. Now this sum equals
Σ[(xi – 1)2 + 2Σ(xi – 1)( 1 – T)] + Σ( 1– T)
2 = Σ[(xi – 1)
2 + 2( 1– T)Σ(xi – 1) +
n1( 1– T)2. Now the middle terms of this expression 2( 1– T)Σ(xi – 1) = 0, as Σ(x – ) ≡ 0
for the values of x that select the .
The first terms, Σ[(xi – 1)2, is our within group sum of squares for group 1 and the last
sum of squares, n1( 1– T)2, is the between group sum of squares from group 1. We do the
same for the other two groups and we add up all of the within groups sums of squares and
all the between groups sums of squares and we take the ratio of the variances
(SSbg/dfbg) /(SSwg/dfwg) = MSbg /MSwg. The observed statistic is an F-statistic with the
degrees of freedom calculated as follows. [MS stands for mean sum of variance.]
The degrees of freedom or df for the sum of squares within group i is ni – 1, the df of the
SSwg = n1 – 1 + n2 – 1 + n3 – 1 = N – 3. [Were there ng groups, this sum would be N - ng.]
The df of the SST = N – 1, therefore the SSbg df is (N – 1) – (N – ng) = ng – 1 = 2 in this
example. So the df of the F-test are (ng - 1, N – ng). As an example of how this is done,
look at the following data of observations from three groups.
Group 1 Group 2 Group 3
132 122 126
132 123 133
131 120 128
138 120 129
137 124 127
132 128
132
If we use the Statistics Template, we get the following results from the ANOVA routine.
24
+-
We find that SSwg1 = (n1-1) = 6(2.82
2) = 47.7143. For all three groups, SSwg = 121.75 and
MSwg = 121.75/(18-3) = 8.12. MSbg = ∑ni( i – T)2 for i = 1, 2, 3 = 366.7/2 = 181.35. The
observed F = 181.75/8.12 = 22.34 >> 6.36, the critical value of F(2, 15, .01). Based on these
results, we conclude, at the one percent significance level, that the means differ. The actual
calculations of the means and standard deviations can be performed on the ANOVA sheet
itself in the rightmost columns and then cell references made in the spread sheet.
B. The Chi-square (χ2) test of independence.
The equivalent test for equal proportions, the χ2 test, is generally stated as a test of
independence. Let’s suppose we have groups of individuals that we believe may differ on
the basis of a particular attribute. For example, Republicans versus Democrats on their
opinion on the health care bill or young people versus old people on whether there should
be cuts in social security benefits. The null and alternative hypotheses are as follows:
H0: Attribute A is independent of Group
Ha: Attribute A is dependent on Group
α = α0
From probability theory, independence means that for two events the P(A given B) =
P(A); for example, the probability of tossing a heads on the second toss of a fair coin
equals the probability of tossing a heads = .5, regardless of what happened on the first toss.
Thus, the result A is independent of any earlier outcome B. Let us consider three groups
3
1 2 3 4 5 6 7
n 7 6 5
x-bar 133.428571 122.833333 128.6
s 2.81999662 2.99443929 2.70185122
SSwg i 47.7143 44.8333 29.2000 0.0000
Total SSwg 121.75
MSwg 8.12 (Standard error within groups) = 2.8489
X-barT 128.555556
SSbg 166.223986 196.462963 0.00987654 0 0 0 0
TSSbg 362.696825
MSbg 181.348413 **You may enter whatever you want to the right
alpha = 1.00% of column H. The small 3X10 calculator pad to
the right may be used to calculate the number
Fobs 22.3431572 of observations, the mean of the samples, and
Fc 6.35884589 their standard deviations for samples of ten or
p-value = 3.1741E-05 less.
Decision: Reject Ho
ANOVA Template for BUSN 5760
Number of Treatment Levels (up to 7) =
25
of registered voters, Republicans, Democrats and Independents and we ask them their
opinion on the health care law. The table below lists the results from the survey.
Group
Opinion
Democrat Republican Independent Total
Number
Favor 250 90 160 500
Against 100 160 240 500
No Opinion 50 100 50 200
Total 400 350 450 1200
Now, if party affiliation did not matter, we should have equal proportions of each party
in favor of, against, or neutral to the legislation. Now we know we will not get exactly the
same proportion because of sampling error, but they should be close enough that one
cannot statistically show a difference. So, for example, we have 1/3 of the group are
Democrats, so 1/3 of those in favor should be Democrats, that is 1/3 of 500 = 166.67. We
can do this for each cell on the tableau. The formula for the expected cell counts is
Nrowi*Nrowj*N = Eij. So for the 2,3 cell - independents against the law - we get
450*500/1200 = 187.5.
Notice that once we E11 and E12, E13 = Erow1 – (E11 + E12) which implies that we lose on
degree of freedom for that row. The same applies to rows 2 and 3. In addition, once I
know E11 and E21, E31 is known. Similarly for columns 2 and 3. In effect, we lose freedom
for the last row and the last column whenever we use observed cell counts to obtain row
and column totals. Therefore, df = (Nr-1)(Nc -1) = (3-1)(3-1) = 4 in this example. [The
shape of the χ2 4 degrees of freedom distribution is shown below.]
Assuming that all expected cell accounts equal or exceed 5 (remember the normal
approximation to the binomial?), we calculate the following statistic for each cell
(Oij – Eij)2/Eij. We add these up for all nine cells. The resulting statistic is distributed as a
χ2 with four degrees of freedom. We can then compare the observed with the critical value
using the χ2 table. So for α = .05, the critical χ
2 is 9.4877. The table below is the output
from the Statistics Template, Chi-square routine using the data above.
26
Our observed value of 152 is much greater than the critical value of 9.49, so we reject
the null hypothesis in favor of the alternative that opinion on this legislation depends on
party affiliation. Thus, at the 5% significance level we conclude that voters’ opinion about
the health care legislation depends on the party registration.
no. of rows = 3 alpha = 5.00%
no. of columns = 3
col1 col2 col3 col4 col5
Row1 250 90 160 500
Row2 100 160 240 500
Row3 50 100 50 200
Row4 0
Row5 0
Row6 0
400 350 450 0 0 1200
1 2 3 4 5
Row1 1 166.6666667 145.8333 187.5 0 0 500
Row2 2 166.6666667 145.8333 187.5 0 0 500
Row3 3 66.66666667 58.33333 75 0 0 200
Row4 4 0 0 0 0 0 0
Row5 5 0 0 0 0 0 0
Row6 6 0 0 0 0 0 0
400 350 450 0 0 1200
1 41.66666667 21.37619 4.033333 0 0 67.07619
2 26.66666667 1.37619 14.7 0 0 42.74286
3 4.166666667 29.7619 8.333333 0 0 42.2619
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
72.5 52.51429 27.06667 0 0 152.081
Chi-sq observed = 152.0809524
Chi-sq(crit) = 9.487728465
p-value = 7.29054E-32
Decision:
Allows one to calculate the Chi-square statistic for an mxn cell contingency table
Chi-square Template
Reject Ho
Observed counts
Expected counts
(b/w 2 and 6)
(b/w 2 and 5)
27
Statistics Lecture Notes – Part 3 Multivariate models and analysis
Up to this point all of our analyses have been based on single variable models.
Sometimes we compared and performed tests of statistics from across populations, but
there always was a single variable involved. In the following analyses, we will be trying to
find whether or not there exists a relationship among variables.
I. Correlation and causality: the simple ordinary least squares regression model.
We assume:
a. the model is linear; that is, Yt = a + bxt + et
b. the error terms et are normally distributed with mean μ = 0 and constant variance σ2
and that this hold for all values of x.
c. the error terms et are uncorrelated with the x variable and with each other.
Now we take a random sample of x, y observations and use these to estimate the two
coefficients in the model, a and b. Now the estimated values, and , will not equal the
true values, but, depending upon how good a fit we get and how large a sample size we use,
the average error in the estimated coefficients will serve as good or not so good indicators
of the true values or parameters.
To understand the rationale and theory behind the process of estimating the parameters
of the model, look at the Cartesian plane below. Suppose our (x, y) points fall as depicted
in the diagram. We see that most of the points are in quadrants I and III, so that both the x
values and y values, with the exception of three of the points, exceed their respective sample
means. Thus, when x > , y > and the products (xi - )(yi - ) are positive (with three
exceptions, but even these are relatively small values). The sum of these cross-products
divided by n-1 is called the sample covariance between the two variables. In this case it is
positive.
Had the points lay mostly in quadrants II and IV, then when x > , y < , and vice
versa and the covariance would have been negative. In this case the covariance would be
negative. So, the direction of the relationship is determined by the sign of the covariance.
y-bar
x-bar
28
Note, that the covariance of a variable x with itself is then Σ(x – )*(x – )/(n-1) = s2. Now
the size of the covariance is meaningless, as one can enlarge it or reduce it by merely
changing the units of measurement of either or both variables. The way we obtain the
slope is to scale the covariance by the variance of the variable on the horizontal axis; thus
= cov(x, y)/var(x). Since both of these are calculated by dividing by n – 1, we can also
write it as = Σ(x – )*(y – y )/ Σ(x – )2.
The corresponding intercept is given as = - because it turns out that the
estimated line of best fit, using the OLS estimation technique to minimize the sum of
squared errors, always passes through the point ( , y ) [see the proofs for this proof and
for more information on OLS regression], where the estimator for b-hat is the same as
given above: = Σ(x – )*(y – y )/ Σ(x – )2.
Then the line of best fit is = + , the estimated error for each observation is = Y -
, and the sum of squared errors SSe = ∑ 2. This vales if the minimum for all choices of
and . The Statistics Template can be used to calculate the line of best fit. Let us take the
following example. We wish to determine how much training improves performance. We
measure training by the number of hours our employees are given in regular training
sessions and the performance by the number of errors made. A random sample of ten
employees revealed the following results.
Employee Hours trained Daily errors made
1 12 5
2 10 6
3 15 4
4 14 5
5 8 10
6 20 3
7 14 5
8 15 6
9 12 6
10 10 5
From the table below we find that the OLS estimated line is given by = + X
= 10.875 - 0.4135X; thus, we predict that a new employee makes about 11 daily errors
and each hour of training reduces the number of errors by .4135 per day.
We now need some sort of statistical test to see if the training has a statistically
significant effect on the number of errors made each day by the employees. To do this we
need first to ask the following questions: “If we did not use our model and the estimate
from OLS, what would we do instead?” Since we were minimizing the sum of squares,
we would want to compare our model’s performance with another technique that used
the same criterion. Also, we would presumably not have another variable to help make
our predictions; in effect, we would assume that each employee’s performance did not
depend on the number of hours of training, but was the same for all employees. In this
case, the number of errors was just random and unpredictable around the mean. Since
we don’t know µ, we use our best guess and calculate = ∑(Y - )
2/(n-1). The ∑(Y -
29
)2 is SST or total sum of squares.
Now we know what value minimizes SST, for a set of data; it’s the sample mean. So we
have two ways of predicting Y: one uses the x values, the other does not. But our
regression model could have chosen to make = 0; then minimizing squares would have
made = . In that case SSe = SST and the improvement from using the x-variable in
reducing the squared errors would be zero.
At the other extreme, if all the points lay on a sloped line, the reduction in errors would
equal the SST and the SSe would be zero. In this case we would have explained 100% of the
variability in Y. Now the measure we use answers the question: “What fraction of the
variance in Y is explained by the one variable we used in our regression?” It is called the
coefficient of determination, or more commonly R2 and is calculated as follows. [Look at
the printout above]
Simple regression Template for BUSN 5760 -0.413462
10.875
Number of observations (up to 20) = 10
Obs X Y X-Xbar (x) Y-Ybar (y) xy x^2 Y-hat e-hat e-hat^2 y^2
1 12 5 -1 -0.5 0.5 1 5.9134615 -0.913462 0.834412 0.25
2 10 6 -3 0.5 -1.5 9 6.7403846 -0.740385 0.548169 0.25
3 15 4 2 -1.5 -3 4 4.6730769 -0.673077 0.453033 2.25
4 14 5 1 -0.5 -0.5 1 5.0865385 -0.086538 0.007489 0.25
5 8 10 -5 4.5 -22.5 25 7.5673077 2.432692 5.917992 20.25
6 20 3 7 -2.5 -17.5 49 2.6057692 0.394231 0.155418 6.25
7 14 5 1 -0.5 -0.5 1 5.0865385 -0.086538 0.007489 0.25
8 15 6 2 0.5 1 4 4.6730769 1.326923 1.760725 0.25
9 12 6 -1 0.5 -0.5 1 5.9134615 0.086538 0.007489 0.25
10 10 5 -3 -0.5 1.5 9 6.7403846 -1.740385 3.028939 0.25
11
12
13
14
15
16
17
18
19
20
13 5.5 0 0 -43 104 55 -2.66E-15 12.72115 30.5
SSE SST
b-hat = -0.413462 R^2 = 0.582913
a-hat = 10.875 r = -0.763487
Y-hat = 10.875 + -0.413462 X F(obs) = 11.18065
t(obs) = -3.343748
se = 1.261009 F(prob) = 0.010174
sb-hat = 0.123652
α =t o b s = - 3 . 3 4 3 7 4 8
2.50% tc (2-tail)= 2.751531
tc (1-tail)= 2.306006
0.010174
0.005087
t-prob (2-tail) =
t-prob (1-tail) =
OLS Regression Line
2.5
2
4
6
8
10
12
5 10 15 20 25X
Y
30
The improvement is given by SSR = SST – SSE (we changed the notation so that SSE =
SSe, etc.) From the SST = Σ(Y – )2 subtract SSE = Σ(Y – )
2 =, where the are the values
predicted from the OLS model. We can now partition the SST into the SSE, which is still
unexplained, and SSR, which is the explained portion of SST. The R2 = SSR/SST. In the
regression above, R2 = (30.5 – 12.72)/30.5= .583. Now the square root of this statistic,
bearing the sign of the slope – in this case negative – is called the sample correlation
coefficient rxy. So robs = - √ = - .7635. We can now test for the significance of this
statistic; it is distributed as a t-statistic with n – 2 degrees of freedom.
Another way to test for the usefulness of the model is to test the R2 directly. This test is
based on our partition of SST into explained and unexplained variances. The unexplained
error SSE is analogous to the SSwg in ANOVA and SSR is analogous to the SSbg. The
degrees of freedom of SST is n – 1 as we lose one degree of freedom because we use
rather than μY in our sample variance. And we lose one more degree of freedom because
we use a variable x to estimate the variance in the regression model. Another way to think
of it is that we estimated two parameters, a and b, in the linear model using our data, so we
lose two degrees of freedom. The best way to understand the concept, however, is to realize
that with a straight line we can always get a perfect fit for any two points as two points
determine a straight line. Thus, we have no degrees of freedom permitting us to be off the
line.
So the SSR/1 is the average amount of error explained by the one variable, x, in the
linear model. We lose two degrees of freedom, leaving us with n – 2 degrees of freedom for
the SSE. Then SSR/[SSE/(n-2)] ~ F1,n-2. As it turns out, an F1,k is just a t2 with k degrees of
freedom. So the t must be the square root of the F, except that the t also bears a sign, in
this case the sign of . We can also calculate the F as follows: Fobs = R2/[(1-R
2)/(n-2)]
(Why?) Now the t-test, which is often more useful than the corresponding F-test that
cannot test for the direction of the relationship, is conducted as follows.
H0: ρ ≥ 0
Ha: ρ < 0
α = α0
where ρ is the population correlation coefficient. An equivalent test, which we do not
discuss here, is the test on the slope coefficient b directly. These two tests are totally
equivalent in the simple OLS model as the only source of correlation is through the one
independent variable in our model; a non-zero slope means we have correlation, a positive
slope means positive correlation, and a negative slope means negative correlation. In our
example, we expect more training to reduce errors, as we conduct a one-tail, less-than test.
From the printout above, we see that the observed t-value is -3.344 and the one-tailed
critical t-value for α = .025 is (-)2.306. Thus, at the 2.5% significance level we reject H0 and
conclude that training time and errors are negatively correlated.
IV. The OLS multiple regression model.
Extending the simple OLS regression model into mode than one explanatory variable is
straight forward. The model is as follows
31
Y = b0 + b1X1 + b2X2 + … + bkXk + e, where
e ~ N( 0, σe), e is uncorrelated with any of the explanatory variables Xi, and the other e’s
and the errors are the same for all values of the independent variables X.
OLS regression then estimates the bi, i = 1, 2, …, k so as to minimize SSE = Σ(Y – )2,
where = 0 + 1*X1 + 2*X2 + … + k*Xk is the vector of predicted Y’s. The R2 is
calculated as before: SST = Σ(Y – )2 and SSR = SST - SSE is the explained variation.
Then Fobs = (R2/k)/[(1-R
2)/(n-(k+1))] ~ Fk, n-(k+1). The test for the overall usefulness is an F-
test and the hypotheses are:
H0: bi = 0 for all i = 1, 2, … , k
Ha: bi ≠ 0 for some i = 1, 2, … , k
α = α0
To reject H0, Fobs > Fcrit = F(k, n-(k+1), α0)
If we reject H0 we should next test for which variables are explanatory; this is a series of
t-tests of the k bi coefficients. Each is of the form
H0: bi = 0 H0: bi ≥ 0 H0: bi ≤ 0
Ha: bi ≠ 0 Ha: bi < 0 Ha: bi > 0
and α = α0
While there is no multiple regression procedure in the Statistics Template, it is very easy
to use Excel to perform multiple regressions. Make sure that the vectors of explanatory
variable are in contiguous columns and then use the Analysis Tool Pack under the Tools
tab (if this is not already loaded onto your computer, use Add Ins, also under Tools, to load
it). [For Excel 2007 and later, you need to go to the icon at the top left, then got to the Excel
Options at the bottom, then select Add Ins to the left, select go at the bottom and select
Analysis Tool Pack] Then highlight the dependent Y variable and the X variables and let
‘er rip.
Example: Suppose we also have data on the ages of the ten employees that we trained in
our earlier example. The table below lists the ages of the employees along with the training
hours and the number of daily errors made after the training.
Errors TrainHrs Age
5 12 49
6 10 30
4 15 30
5 14 29
10 8 25
3 20 40
5 14 54
6 15 32
6 12 35
32
5 10 42
We regressed the first column on the second and third and got the results given above.
The two independent variables were able to explain about 2/3 of the variation in Errors.
The observed F value was 7.148 and the degrees of freedom were 2 in the numerator (the
number of explanatory variables) and 7 in the denominator (10 – (2 + 1)). From the F-
table with α = .05, the critical F is 4.737; that is, if the true slopes for all the variables were
zero, we would obtain an F value exceeding 4.737 less than 5% of the time. Therefore, we
reject the null hypothesis and conclude the model is useful.
As for which of the variables have an effect on Errors, we see that the coefficient for
TrainHrs remains about the same as before, -.378, and the t = -3.146 is still rather large (in
absolute value). The critical t-value is -t7,.05 = -1.89 (using a less than test at a .05
significance level). Thus, we conclude, at the 5% significance level, that the number of
hours of training had a negative impact on the number of daily errors made.
As for Age, the data suggest that older employees make fewer errors, but the observed t-
statistic of -1.372 is not significant at the 5% significance level (assuming a less-than test),
so we cannot conclude that age matters much. Thus, at the 5% significance level, we
cannot conclude that the age of the employee affected the number of errors made each
day.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.819327
R Square 0.671297
Adjusted R Square0.577382
Standard Error1.196747
Observations 10
ANOVA
df SS MS F Significance F
Regression 2 20.47457 10.23728 7.147921 0.020362
Residual 7 10.02543 1.432204
Total 9 30.5
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%Upper 95.0%
Intercept 12.58934 2.00798 6.269653 0.000416 7.841223 17.33745 7.841223 17.33745
TrainHrs -0.378037 0.120158 -3.146165 0.016239 -0.662165 -0.093909 -0.662165 -0.093909
Age -0.059422 0.043313 -1.37194 0.212432 -0.161841 0.042996 -0.161841 0.042996
33
Index of Terms Used in this Text
analysis of variance technique, ANOVA ................................................................................... 23
binomial distribution .................................................................................................................... 9
central limit theorem .................................................................................................................. 14
Chebychev ...................................................................................................................................... 4
Chi-square (χ2) test of independence ......................................................................................... 23
continuous distribution ................................................................................................................. 4
correlation .................................................................................................................................... 28
covariance .................................................................................................................................... 28
discrete distribution ...................................................................................................................... 8
distribution .................................................................................................................................... 4
expected value................................................................................................................................ 4
exponential distribution ............................................................................................................. 12
hypothesis test.............................................................................................................................. 15
inferential statistics ....................................................................................................................... 14
mean ............................................................................................................................................... 4
median ............................................................................................................................................ 4
mode ............................................................................................................................................... 4
multivariate models .................................................................................................................... 28
normal distribution ....................................................................................................................... 7
parameter....................................................................................................................................... 4
Poisson Distribution ...................................................................................................................... 8
probability density function (pdf)................................................................................................ 7
probability distribution ................................................................................................................ 4
qualitative data .............................................................................................................................. 4
random sampling .......................................................................................................................... 4
random variable ............................................................................................................................ 4
random variate .............................................................................................................................. 4
range ............................................................................................................................................... 4
significance level .......................................................................................................................... 15
simple ordinary least squares regression model ...................................................................... 28
standard deviation ........................................................................................................................ 4
uniform distribution ..................................................................................................................... 6
variance .......................................................................................................................................... 4
z-scores ........................................................................................................................................... 7