7/30/2019 Univariate Stats
1/52
AAppendix A Introduction toProbability and Statistics
A.1 Introduction
If youve heard this story
before, dont stop me, be-
cause Id like to hear it
again. Groucho Marx
Many questions have probabilistic solutions. For example, are two objects/motifs similar
because of some relationship, or is it coincidence? How do you classify similar items? How
surprising is the occurence of an unusual event? How do we describe probabilistic outcomes
using models? What do we do when an exact model is not available? Which hypothesis/model
is more relevant/accurate/believable?
Before we talk models, we need to know about discrete and continuous random variables,
and the testing of hypotheses. Random experiments must satisfy the following:Random experiments
1. All possible discrete outcomes must be known in advance.
2. The outcome of a particular trial is not known in advance, and
3. The experiment can be repeated under identical conditions.
A sample space S is associated with a random experiment = set of all outcomes.
Example A.1 S for a coin toss is {H, T}, and for a die roll = {1, 2, 3, 4, 5, 6}. An event Ais a subset of S, and we are usually interested in whether A happens or not. e.g. A = odd
value on roll of a die; A =H on tossing a coin.
The random variable X is some function mapping outcomes in space S to the real line.
X : S R. e.g. with two coin tosses, S = {(H, T), (T, H), (H, H), (T, T)}, X = number ofheads turning up, X(H, T) = 1, X(H, H) = 2 etc.
Random variable Random variable: A real valued function assigning probabilities to different events in a
sample space. This function could have discrete or continuous values. For example,
1. Sum of two rolls of a die ranges from 2 122. The sequence HHTH of coin tosses is not a random variable. It must have a numerical
mapping.
Such variables are i.i.d.: in-
dependent and identically
distributed.
A function of a random variable defines another random variable. Measures such as mean and
Random variable = Nocontrol over outcome.
variance can be associated with a random variable. A random variable can be conditioned
on an event, or on another random variable. The notion of independence is important: the
random variable may be independent from an event or from another random variable.
Statistics refers to drawing inferences about specific phenomena (usually a random phe-
nomenon) on the basis of a limited sample size. We are interested in descriptive statistics: we
need measures of location and measures of spread.
Data scales are classified as categorical, ordinal or cardinal. Categorical scale: A label per
individual (win/lose etc.). If the label = name, we refer to the data as nominal data. Ordinal
data can be ranked, but no arithmetic transformation is meaningful. e.g. vision (= 20-20,
20-30), grades etc. Cardinal data is available in interval or ratio scales. Ratio scale: we knowWe mostly use cardinal data.
zero point (Eg. ratio of absolute temperatures is meaningful in K and not C. The same formass flow ratios.) Interval scale (arbitrary zero): F, C.
Cardinal data may be sorted and ranked: info may be lost in the process, but analysis
A1cSBN, IITB, 2012
7/30/2019 Univariate Stats
2/52
A2 Introduction to Probability and Statistics
may be easier. Consider the following transformation: If 4 individuals A, B, C & D have
heights (in inches) 72, 60, 68 and 76 respectively, they may be transformed to ranks: 2, 4,
3 & 1. The height difference between B & A = difference between B and C, and thereforetransformations may lead to incorrect inferences. Ordinal data are protected by monotonic
transforms: If x > y , then log(x) > log(y); however |x y| = | log(x) log(y)| in general.
A.2 Discrete random variables
Measures of location10 5 0 5 10
0.0
0.1
0.2
0
.3
0.4
x
Density
yi == xi yi == xi ++ c
I II
Fig. A.1: Translating variables
on a linear scale.
Mean
The sample mean is x and the population mean is .
x =1
n
n
i=1
xi =1
N
N
i=1
xi (A.1)
where N = size of total population and is usually unknown. The arithmetic mean can be
translated or rescaled.
Translation: If c is added to every xi, then if yi = xi + c, then y = x + c (Fig. A.1).
Scaling: If yi = cxi then y = cx.
Median
For n observations, the sample median is
(n + 1)/2th largest observation for odd n,
average of n/2th and (n/2) + 1th largest observations for even n.
Mode
Most frequently occurring value. Can be problematic if each xi is unique.
Geometric mean
xG = (ni=1 xi)
1/nand on a log scale, antilog[(
ni=1 log xi)/n]
Harmonic mean
1/xH = (1/n)ni=1(1/xi). For positive data, xH < xG < x.
Display of data
Pareto diagram
Dot diagram
2 0 2 4 6
2
4
6
8
10
value
frequency
Fig. A.2: The dot diagram.Given n = {3, 6, 2, 4, 7, 4, 3} the dot diagram is shown in Figure A.2. This simple plot can beuseful if there exists lots of data. From a visual inspection of the plot, 2 could be consideredan outlier. Another advantage of these plots is that they can be superimposed.
Frequency Distributions
x
Frequency
2 3 4 5 6 7 8 9
0
1
2
3
4
5
Fig. A.3: A frequency distribu-
tion.
A frequency distribution is shown as a histogram in Figure A.3. Frequencies for each class/category
are tallied (sometimes using the # notation). Endpoints can be conveniently represented us-
ing bracket notation: in [1, 10), 1 is included and 10 is not. In [10, 19) 10 is included whentallying. Grouping together data loses information (in the form of specific individual values).
Cumulative distributions can also be generated from the frequency curves. If the height of a
7/30/2019 Univariate Stats
3/52
A3 Discrete random variables
rectangle in a histogram is equal to the relative frequency/width ratio, then total area = 1 we have a density histogram.
Stem and leaf displays
1
2
3
4
5
2 7 5
9 1 5 4 8
4 9 2 4
4 8 2
3
i.e.
12 17 15
29 21 25 24 28
34 39 32 34
44 48 42
53
Measures of spread
Range
The range is the difference between largest and smallest observations in a sample set. It is
easy to compute and is sensitive to outliers.
Quantiles/percentiles
The pth percentile is that threshold such that p % of observations are at or below this value.
Therefore it is the
(k + 1)th largest sample point if np/100 = integer. [k = largest integer less than np/100]. Average of (np/100)th and (np/100 + 1)th values if np/100 = integer. For 10th percentile
p/100 = 0.1
The median = 50th percentile. The layout of quantities which calculation of percentiles
requires gives information about spread/shape of distribution. However data must be sorted.
Let Q1= 25th percentile, Q2= 50
th percentile = median, and Q3 = 75th percentile. Then
2
3
4
5
6
7
8
9
Q1
Q2
Q3
Fig. A.4: A box plot. for the pth percentile, (1) order the n observations (from small to large), (2) determine np/100,
(3) if np/100 = integer, round up, (4) if np/100 = integer = k, average the kth and (k + 1)thvalues.
The range = (min value - max value) was strongly influenced by outliers. The interquartile
range = Q3 Q1 = middle half of data and is less affected by outliers. The boxplot (Fig. A.4)is a graphical depiction of these ranges. A box is drawn from Q1 to Q3, a median is drawn as
a vertical line in the box, and outer lines are drawn either up to the outermost points, or to
1.5(box width) = 1.5(interquartile range).
Deviation
The sum of deviations from the mean is d = ni=1
(xi
x)/n = 0 and is not useful. The
mean deviation isni=1 |xi x|/n. This metric does measure spread, but does not accurately
reflect a bell shaped distribution.
Variance and Standard deviation
A variance may be defined as
S2 =
ni=1(xi x)2
n 1 2 =
Ni=1(xi )2
N(A.2)
where N denotes the size of a large population. The sample variance S2 uses (n 1) insteadof n in the denominator. (There are only (n 1) independent deviations if x is given, with
their sum being constrained to be 0). This definition of S2
with (n 1) is used to betterrepresent the population variance 2. A large S2 implies large variability. S2 is always > 0.
The definition of S2 above requires that x be computed first and then S2; two passes of the
10 5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
Density
yi == xi
yi == cxiIII
Fig. A.5: Effect of scaling on
variance.
7/30/2019 Univariate Stats
4/52
A4 Introduction to Probability and Statistics
data are required. A more convenient form (given below) permits simultaneous computation
of
x2i and
xi, and thus requires one pass.
S2 =
ni=1(xi x)2
n 1 =ni=1 x
2i (
xi)
2/n
n 1 (A.3)
For translated data, yi = xi + c (i = 1 to n) y = x + c and S2
y = S
2
x. For scaled samples,yi = cxi for i = 1, ...,n, c > 0 S2y = c2S2x (Fig. A.5).
Coefficient of variation
CV = 100%S
x(A.4)
This metric is dimensionless, hence one can discuss S relative to the means magnitude.
This is useful when comparing different sets of samples with different means, with larger
means usually having higher variability.
Indices of diversity
Such indices are used for nominal scaled data where mean/median do not make sense e.g. todescribe diversity among groups. Uncertainty is a measure of diversity. For k = categories
More on uncertainty later.
For now, notice the connec-
tion to entropy!
and pi = fi/n = proportion of observations in a category
H = ki=1
pi logpi =n log n ki=1 fi log f
n(A.5)
Hmax = log k. Shannons index J = H/Hmax.
Probability mass function (probability distribution)
The assignment of a probability to the value r that a random variable X takes is represented
as P rX(X = r) Eg. Two coin tosses X = number of heads.
P rX(x) =
1/2 if x = 1
1/4 if x = 0 or x = 2
0 otherwise
Obviously 0 P r(X = r) 1 and rSP(X = r) = 1.Notation: we will use P(X =
r) instead of P rX(X = r)A probability distribution is a model based on infinite sampling, identifying the fraction of
outcomes/observations in a sample that should be allocated to a specific value. Consider a
series of 10 coin tosses where the outcome = 4 heads. A probability distribution can tell us
how often 4 heads crop up when the 10 coin toss experiment is repeated 1000 times.
To establish the probability that a drug works on {0, 1, 2, 3, 4} out of 4 patients treated,theoretical calculations (from some distribution) are compared with some frequency counts
generated from 100 samples.
r P(X = r) Frequency distribution (100 samples)
0 0.008 0.000 [0/100]
1 0.076 0.090 [9/[100]
2 0.265 0.240 [24/100]
3 0.411 0.480 [48/100]
4 0.240 0.190 [19/100]
Compare the two distributions. Is clinical practice expectations from experience? Givendata and some understanding of the process, you can apply one out of many known PMFs
to generate the middle column above.
7/30/2019 Univariate Stats
5/52
A5 Discrete random variables
Expectation of a discrete random variable
The expectation of X is denoted
E[X] =R
i=1
xiP(xi) (A.6)
where xi = value i of the random variable. e.g. probability that a drug works on {0, 1, 2, 3, 4}out of four patients. E[X] = 0(0.008) + 1(0.076) + 2(0.265) + 3(0.411) + 4(0.240) = 2.80.Therefore 2.8 out of 4 patients would be cured, on average.
r P(X = r)0 0.008
1 0.076
2 0.265
3 0.411
4 0.240
1.000
For the case of two coin tosses, if X = number of heads, then E[X] = 0(1/4) + 1(1/2) +2(1/4) = 1. If the probability of a head on a coin toss was 3 /4
P(X = k) =
(1/4)2 = 1/16 k = 0 Binomial random variable
2 1/4 3/4 k = 1 n = 2, p = 3/4(3/4)2 k = 2
E[X] = 0(1/4)2 + 1(2
1/4
3/4)2 + 2(3/4)2 = 24/16 = 1.5
Example A.2 Consider a game where 2 questions are to be answered. You have to decideExpectations can be com-
pared.which question to be answer first. Question 1 may be correctly answered with probability 0.8
in which case you win Rs. 100. Question 2 may be correctly answered with probability 0.5
in which case you win Rs. 200. If the first question is incorrectly answered, the game ends.
What should you do to maximize your prize money expectation?
Soln:
Notice a tradeoff: Asking the more valuable question first runs the risk of not getting to
answer the other question. If total prize money = X = random var, then E[X] assuming Q1is asked first is P MF[X] : pX(X = 0) = 0.2, pX(X = 100) = pX(X = 300) = 0.8 0.5.Therefore E[X] = 0 0.2 + 100 0.8 0.5 + 300 0.8 0.5 = Rs. 160.
E[X] assuming Q2 is asked first is P M F[X] : pX(X = 0) = 0.5, pX(X = 100) = 0.5 0.2, pX(X = 300) = 0.5 0.8 hence E[X] = 0 0.5 + 200 0.5 0.2 + 300 0.5 0.8 =Rs.140. Hence ask question 1 first.
Example A.3 The weather is good with a probability of 0.6. You walk 2 km to class at aExpectation comparisons
can be misleading!speed V = 5 kph or you drive 2 km to class at a speed V = 30 kph. If the weather is good
you walk. what is the mean of the time T to get to class? What is the mean of velocity V?
Soln:
PMF of T = PT(t) = 0.6 if t = 2/5 hrs0.4 if t = 2/30 hrs
Therefore E[T] = 0.6 2/5 + 0.4 2/30 = 4/15 hrs. E[v] = 0.6 5 + 0.4 30 = 15 kph. NoteT = 2/V. However E[T] =E[2/V] = 2/E[V].
that E[T] = 4/15 = 2 km/E[v] = 2/15.
Variance of a discrete random variable
The variance of a discrete random variable is defined as
Var [X] = 2X =
Ri=1
(xi )2P(X = xi) (A.7)
The standard deviation SD(X) =Var [X] = and has the same units as the mean (and X).Using the E[] notation,
Var [X] = 2X = E
(X E[X])2 = EX2 (E[X])2 = EX2 2 (A.8)
7/30/2019 Univariate Stats
6/52
A6 Introduction to Probability and Statistics
Moments
k denotes the kth moment about the origin
=
ixki PX(xi) (A.9)
Hence the mean = = E[X] =i xiP(xi) = 1st moment about origin. k is the kth momentabout mean.
k =i
(xi )kPX(xi) (A.10)
Variance = 2nd moment about mean = 2. Then 2 =
2 23/3 = skewness (degree of asymmetry).
33
=E(X E[X])3
3(A.11)
Skew to right (+ve skew) or left (ve skew). The mean is on same side of mode as longer tail.An alternate measure of skew is (mean - mode)/, but to avoid using mode, use skewness =3(mean - median)/.
Kurtosis (peakedness) is
44
=E(X E[X])4
4(A.12)
A moment generating function of X is defined asIf t is replaced with syou have the Laplace trans-
form, and if replaced with it,
the Fourier transform. Its a
small world...
M(t) = MX(t) = E
etX
= E
1 + tX+t2x2
2!+ ...
(A.13)
Then n = E[Xn] is the nth moment of X and is equal to M(n)(0) (from a comparison witha Taylors expansion of M(t)), the nth derivative of M, evaluated at t = 0.
M(t) = 1 + 1t +2t2
2!+ ... +
ntnn!
+ ...
If y =ni=1 xi, then
Like a Binomial (several
coin tosses) from several iid
Bernoullis.
My(t) = E
ety
= Eetx1etx2 ...etxn = Eetx1 Eetx2 ...Eetxn= Mx1(t)Mx2(t)...Mxn(t) =
ni=1
Mxi(t) (A.14)
Chebyshevs theorem
If a probability distribution has mean and standard deviation , the probability of getting
a value deviating from by at least k is at most 1/k2. We can talk in terms of k, for
example 2, 3 etc.
P(|X E[X]| k) 1k2
(A.15)
This holds for any probability distribution!2 0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
x
Den
sity
R2R1 R3
KK
Fig. A.6: Proof of Chebyshevs
theorem.
Proof:
2X =Ri=1
(xi )2P(X = xi)
Consider three regions R1, R2 and R3 (see Fig. A.6) where R1 : x k, R2 : k 1}. Then if X = 1, E[X|X = 1] = 1 (we have success on the first try). Ifthe first try fails (X > 1), one try is wasted, but we are back where we started expectednumber of remaining tries is still E[X] and therefore E[X|X > 1] = 1 + E[X]
E[X] = P(X = 1)E[X|X = 1] + P(X > 1)E[X|X > 1]= p + (1 p)(1 + E[X]) = 1
p(on rearranging)
Similarly EX2|X = 1 = 1 which implies thatEX2|X > 1 = E(1 + E[X])2 = 1 + 2E[X] + EX2
Therefore,
EX2 = p 1 (1 p)(1 + 2E[X] + EX2)=
1 + 2(1 p)E[X]p
=2
p2 1
p
Var [X] = EX2 (E[X])2 = 2p2
1p
1p2
=1 p
p2
Multiple random variables: Joint PMFS
Given two discrete random variables X and Y in the same experiment, the joint PMF of X
and Y is PX,Y(X = x, Y = y) for all pairs of (x, y) that X and Y can take.
PX(x) =y
PX,Y(x, y) PY(y) =x
PX,Y(x, y) (A.25)
7/30/2019 Univariate Stats
11/52
A11 Discrete random variables
PX(x) and PY(y) are called marginal PMFs. If Z = g(X, Y), PZ(z) =
PX,Y(x, y) where
the summation is over all (x, y).
PZ(z) =
{(x,y)|g(x,y)=z}PX,Y(x, y) (A.26)
E[g(x, y)] =x,y
g(x, y)PX,Y(x, y) (A.27)
For example, E[aX+ bY + c] = aE[X] + bE[Y] + c.For three variables:
PX,Y(x, y) =z
PX,Y.Z(x,y,z) (A.28)
PX(x) =y
z
PX,Y.Z(x,y,z) (A.29)
E[g(X , Y , Z )] =x,y,z
g(x,y,z)PX,Y.Z(x,y,z) (A.30)
For n random variables X1, X2...Xn and scalars a1, a2...anWe saw this before: a bino-
mial is constructed from sev-
eral Bernoulli variables.
E[a1X1 + a2X2 + ... + anXn] = a1E[X1] + a2E[X2] + ... + anE[Xn]For example,
E[aX+ bY + cZ+ d] = aE[X] + bE[Y] + cE[Z] + d
Conditional PMFs
Given that event A has occurred, we can use conditional probability: this implies a new
sample space where A is known to have happened. The conditional PMF of X, given that
event A occurred with P(A) > 0 is
PX|A(x) = P(X = x|A) = P({X = x} A)P(A)
(A.31)
x
P({X = x} A) = P(A) (A.32)
Events {X = x} A are disjoint for different values of x and hence PX|A = 1 PX|A is aPMF.
Example A.5 X = roll of a die and A = event that roll is an even number
PX|A = P(X = x|roll is even) P(X = x and X is even)P(roll is even) =
1/3 if x = 2, 4, 60 otherwise
Conditional PMF of X, given random variable Y = y is used when one random variable is
conditioned on another.
PX|Y = P(X = x|Y = y) = P(X = x, Y = y)P(Y = y) =PX,Y(x, y)
PY(y)
Note thatx PX|Y(x|y) = 1 for normalization. PX,Y(x, y) = PY(y)PX|Y(x|y) = PX(x)PY|X(y|x)
we can compute joint PMFs from conditionals using a sequential approach.
Example A.6 Consider four rolls of a six sided die. X = number of 1s and Y = numberof 2s obtained. What is PX,Y(x, y)?
7/30/2019 Univariate Stats
12/52
A12 Introduction to Probability and Statistics
Soln:
The marginal PMF PY is a binomial.
PY(y) =4Cy(1/6)
y(5/6)4y, y = 0, 1, 2, 3, 4
To get conditional PMFs PX|Y(x
|y) given that Y = y, X = number of 1s in the 4
y rolls
remaining, values 1, 3, 4, 5, 6 occur with equal probability. ThereforePY|X(y|x) =4y Cx(1/5)x(4/5)4yx for all x, y = 0, 1, 2, 3, 4 and 0 x + y 4
For x, y such that 0 x + y 4
PX,Y(x, y) = PY(y)PX|Y(x|y)=4Cy(1/6)
y(5/6)4y 4yCx(1/5)x(4/5)4yx
For other (x, y), PX,Y(x, y) = 0
Marginal PMFs can therefore be computed from conditional PMFs:This is the total probability
theorem in a different nota-
tion.
PX(x) =y
PX,Y(x, y) =y
PY(Y = y)PX|Y(x|y)
Independence
Random variable X is independent of event A if P(X = x and A) = P(X = x)P(A) =
PX(x)P(A) for all x. As long as P(A) > 0, since PX|A(x) = P(X = x and A)/P(A) thisimplies that PX|A(x) = PX(x) for all x.
Random variables X and Y are independent ifThe value of Y(= y) has no
bearing on the value of X(=
x).
PX,Y(x, y) = PX(x)PY(y) (A.33)
for all x, y PX|Y(x, y) = PX(x) for all y with PY(y) > 0, and all x.Random variables X and Y are conditionally independent given an event A if
p(A) > 0.P(X = x, Y = y|A) = P(X = x|A)P(Y = y|A) (A.34)
for all x and y . PX|Y(x|y) = PX|A(x) for all x and y such that PY|A(y) > 0.For independent variables,
Independence implies that
E[XY ] = E[X] E[Y]. E[XY ] =x
y
xyPX,Y(x, y) =x
y
xyPX(x)PY(y)
=x xP
X(x)y yP
Y(y) = E[X] E[Y] (A.35)Similarly
E[g(X)h(Y)] = E[g(X)] E[h(Y)] (A.36)
For two independent random variables X and Y,Expectations of combined
variables always add up.
Variances add up only if X
and Y are independent.
E[X+ Y] = E[X] + E[Y] (A.37)
Var [Z] = E(X + Y E[X+ Y])2 = E(X+ Y E[X] E[Y])2= E((X E[X]) + (Y E[Y]))
2
= E(X E[X])2+ E(Y E[Y])2+ 2E[(X E[X])](Y E[Y])]= E(X E[X])2+ E(Y E[Y])2 = Var [X] + Var [Y] (A.38)
7/30/2019 Univariate Stats
13/52
A13 Discrete random variables
Independence for several random variables
Independence PX,Y,Z(x,y,z) = PX(x)PY(y)PZ(z) for all x,y,z etc. Obviously g(X, Y)andh(Z) are independent. For n independent variables X1, X2...Xn
Var [X1 + X2 + ...Xn] = Var [X1] + Var [X2] + ... + Var [Xn] (A.39)
Example A.7 Variance of the binomial: Given n independent coin tosses , with p =
probability of heads. For each i let Xi = Bernoulli random variable (= 1 if heads, = 0 if tails)
X = X1 + X2 + ...Xn. Then X is a binomial random variable. Coin tosses are independent X1, X2....Xn are independent variables and
Var [X] =ni=1
Var [Xi] = np(1 p)
Covariance and Correlation
Covariance of X & Y (random vars) = cov [X, Y]
cov[X, Y] = E[(X E[X])(Y E[Y])] = E[XY] E[X] E[Y] (A.40)When cov [X, Y] = 0, X & Y are uncorrelated. If X & Y are independent,
If X & Y are independent,
they are also uncorrelated. E[XY ] = E[X] E[Y] cov[X, Y] = 0
The reverse is not true! Be-
ware!Example A.8 Let (X, Y) take values (1, 0), (0, 1), (1, 0), (0, 1) each with probability1/4 i.e. marginal probability of X & Y are symmetric around 0. So E[X] = E[Y] = 0. For all
possible values of (x, y), either x or y = 0.Uncorrelated variables need
not be independent! XY = 0 E[XY ] = 0 cov [X, Y] = E[(X E[X])(Y E[X])] = E[XY ] = 0 X & Y are uncorrelated. However X & Y are not independent: nonzero X implies Y = 0etc.
Correlation Coefficient of X & Y
Let X & Y = random variables with nonzero variances. Correlation coefficient = normalizedNote that 1 XY 1. version of covariance =
xy =cov [X, Y]
Var [X]Var[Y](A.41)
Example A.9 Consider n independent tosses of a biased coin (head = p). Let X = # of
heads & Y = # of tails. Are X, Y correlated?
Soln:
For all (x, y), x + y = n E[X] + E[Y] = n X E[X] = (Y E[Y])cov [X, Y] = E[(X E[X])(Y E[X])] = E(X E[X])2 = Var [X]
Note that Var [Y] = Var [n X] = Var [X] = Var [X].
(X, Y) = cov [X, Y]Var [X]Var[Y]
= Var [X]Var [X]Var[Y]
= 1
7/30/2019 Univariate Stats
14/52
A14 Introduction to Probability and Statistics
Correlation
For the sum of several, not necessarily independent variablesGeneral case.
Var
ni=1
ciXi
=
ni=1
c2iVar [Xi] + 2ni=1
nj=1
cicjcov [XiXj ] , i < j (A.42)
Var [c1X1 + c2X2] = c21Var [X1] + c
22Var [X2] + 2c1c2cov[X1X2]
= c21Var [X1] + c22Var [X2]
+2c1c2
Var [X1]Var[X2](X1, X2)
A.3 Continuous probability distributions
Continuous vs Discrete: Continuous distribs are more fine grained, may be more accurate and
permit analysis by calculus. A random variable X is continuous if its probability distribution
can be defined by a non negative function fX of X such that
P(X) =
fX(x)dx (A.43)
i.e P(X is between a and b) =ba
fX(x)dx = area under the curve from a to b. Total area
under the curve = 1. fX(x) is a probability density function (PDF). Intuitively, PDF of X is
large for high probability and low for low probability. For a single value of X, P(X = a) =aa
fX(x)dx = 0. So including endpoints does not matter
P(a X b) = P(a < X < b) = P(a X < b) = P(a < X b)
For fX to be a valid PDF, it must befX(x) is not the probability
of an event: it can even be
greater than one.
1 Positive for all x, fX(x) 0 for all x2 fX(x)dx = P( < X < ) = 1
P([x, x + x]) = x+xx
fX(t)dt fX(x)x where fX(x) = probability mass per unitlength near x.
Expectation and Moments for continuous RVs
E[X] =
xfX(x)dx (A.44)
The nthmoment is E[Xn]. The variance isE[g(x)] = g(x)fX(x)dx,
g(x) may be continuous or
discrete.
Var [X] = E(X E[X])2 = EX2 (E[X])2 =
(x E[X])2fX(x)dx > 0
If Y = aX+ b, E[Y] = aE[X] + b and Var [Y] = a2Var [X]
Discrete Uniform law: Sam-
ple space had finite number
of equally likely outcomes.
For discrete variables, we
count the number of out-comes concerned with an
event. For continuous vari-
ables we compute the length
of a subset of the real line.
Continuous uniform random variable
The PDF of a continuous uniform random variable is
fX(x) = c, a x b0, otherwise
(A.45)
where c must be > 0. For f to be a PDF, 1 =ba
c dz = cba
dz = c(b a) c = 1/(b a)
7/30/2019 Univariate Stats
15/52
A15 Continuous probability distributions
Piecewise constant PDF, general form:
fx(x) =
ci if ai x ai+1, i = 1, 2,.....,n 10 otherwise
1 =ana1 fX(x)dx =
n1i=1
anai cidx =
n1i=1 c
i(ai1 ai)
fX(x) =
1/2
x if 0 < x 1
0 otherwise
fX(x) =
10
1
2
xdx =
x10
= 1
Mean and variance of the uniform random variable:
0 5 10 20 30
0.
0
0.
4
0.
8
x
fx C1
C2C3
Fig. A.11: The piecewise con-
stant distribution.E[X] =
xfX(x)dx = b
a
x1
b
a
dx =1
b
a
12
x2b
a
=1
b a b2 a2
2=
a + b
2
The PDF is symmetric around (a + b)/2)
EX2 = ba
x21
b a dx =1
b a x3
3|ba =
b3 a33(b a) =
a2 + ab + b2
3
Var [X] = EX2 (E[X])2 = (a2 + ab + b2)3
a + b
2
2=
(b a)212
Example A.10 Let [a, b] = [0, 1] and let g(x) =
1 if x 1/32 if x > 1/3
1. Discrete: Y = g(X) has PMF
PY(1) = P(X 1/3) = 1/3PY(2) = P(X > 1/3) = 2/3
E[Y] = 13
1 + 23
2 = 53
2. Continuous:
E[Y] =10
g(x)fx(x)dx =
1/30
1dx +
2/31/3
2dx =5
3
Exponential continuous distribution
PDF of X is
fX(x) =
ex if x 0
0 otherwise(A.46)
Is it a PDF?
fX(x)dx =
0
exdx = ex0
= 1
For any a 00 20 40 60 80
0.
00
0.
02
0.
04
x
fx
Fig. A.12: The exponential dis-tribution with = 0.05.
P(X a) =
aexdx = exa = ea
Examples: Time till a light bulb burns out. Time till an accident.
7/30/2019 Univariate Stats
16/52
A16 Introduction to Probability and Statistics
Mean
E[X] =0
x(ex)dx = (xex)|0 +0
exdx = 0 ex|0
=
1
Variance
EX2 = 0
x2(ex)dx = (x2ex)|0 + 0
2xexdx = 0 + 2E[X] = 22
Var [X] =2
2
1
2=
1
2
Example A.11 The sky is falling... Meteorites land in a certain area at an average of
10 days. What is the probability that the first meteorite lands between 6 AM and 6 PM of
the first day given that it is now midnight.
Soln:
Let X = elapsed time till strike (in days) = exponential variable
mean = 1/ = 10 days
= 1/10.
P
1
4 X 3
4
= P
X 1
4
P
X 34
= e(1/4) e(3/4) = e1/40 e3/40 = 0.047
Probability that meteorite lands between 6 AM and 6 PM of some day:
For kth day, this time frame is (k 3/4) X (k 1/4)
=k=1
P
k 3
4
X
k 1
4
=
k=1
P
X
k 3
4
k=1
P
X k 1
4
= k=1
(e(4k3)/40 e(4k1)/40)
Cumulative distribution function (CDF)
CDF of X
FX = P(X x) =
kx PX(k) X : Discretex fX(t)dt X : Continuous
(A.47)
For the discrete form the CDF will look like a staircase.
Some properties of CDFs FX(x)
FX(x) is monotonically nondecreasing: if x y, then FX(x) FX(y). FX(x) 0 as x , and FX(x) 1 as x +. If X is discrete, FX is piecewise constant (staircase). If X is continuous, FX is continuous. If X is discrete:
CDF from PMF:
FX(k) =k
i=PX(i)
PMF from CDF: (for all k)
PX(k) = P(X k) P(X k 1) = FX(k) FX(k 1)
7/30/2019 Univariate Stats
17/52
A17 Continuous probability distributions
If X is continuous: CDF from PMF:
FX(x) =
x
fX(t)dt
PMF from CDF:
fX(x) =dFX
dx
x
Geometric vs. Exponential CDFs
Let X = geometric random var. with parameter p. For example, X = number of trials before
first success with probability of success per trial = p.
P(X = k) = p(1 p)k1, for k = 1, 2, .....
FgeoX (n) =
nk=1
p(1 p)k1 = p 1 (1 p)n
1 (1 p) = 1 (1 p)n, n = 1, 2...
Let Y = exponential random var, > 0
2 4 6 8 10
0.
0
0.
2
0.
4
x
fx
Fig. A.13: Geometric vs. expo-
nential distributions (p = 0.15
and = 0.185).
FexpY (x) =x0
etdt = et|x0 = 1 ex, for x > 0
Comparison: Let
= ln(1 p)
e = 1 p Exponential and geometric CDFs match for all x = n, n = 1, 2,...
Fexp(n) = Fgeo(n), n = 1, 2...
Normal (Gaussian) random variables
Most useful distribution: use it to approximate other distributions
PDF of X = fX(x) =12
e(x)2/22 < x < (A.48)
for some parameters & ( > 0).
Normalization confirms that it is a PDF:
fX(x)dx =12
e(x)2/22dx = 12 0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
x
Density
ba
fx
Fig. A.14: Normal distribu-
tion.
E[X] = mean = (from symmetry arguments alone). Var [X] =
(x E[X])2fX(x)dx =
(x )2 12
e(x)2/22dx
Let z = (x )/
Var [x] =22
z2ez2/2dz =
22
(zez2/2)
+
22
ez2/2dz
=22
ez2/2dz = 2
X = normal var with and 2
= (mean,var). Height of N(, ) = 1/2 and hence height 1/. If Y = aX + b then E[Y] = aE[X] + b and Var [Y] = a22 where X and Y areindependent.
7/30/2019 Univariate Stats
18/52
A18 Introduction to Probability and Statistics
Standard Normal Distribution
A normal distribution of Y with mean = 0 and 2 = 1; N(, 2) = N(0, 1) = standardnormal
CDF of Y = (y) = P(Y y) = P(Y < y) = 12
y
et2/2dt (A.49)
(y) is available as a table. If the table only gives the value of (y)for y 0, use symmetry.
Example A.12 (0.5) = P(Y 0.5) = P(Y 0.5) = 1 P(Y 0.5) = 1 (0.5) =1 0.6915 = 0.3085
Given a normal random var
X with mean and variance
2, use
Z =(x )
Then
E[Z] = (E[X] )
= 0
Var [Z] =Var [X]
2= 1
Z = standard normal.
Example A.13 Average rain at a spot = normal distribution, mean = = 60 mm/yr and
= 20 mm/yr. What is the probability that we get at least 80 mm/yr?
Soln:Let X = rainfall/yr. Then Z = X / = (X 60)/20. So
P(X 80) = P(Z 80 6020
) = P(Z 1) = 1 (1) = 1 0.8413 = 0.1587
In general
P(X x) = P
X
x
= P
Z x
=
x
where Z is standard normal. 100 uth percentile ofN(0, 1) = Zu and P(X < Zu) = u whereX
N(0, 1)
Approximating a binomial with a normal
Binomial: Probability of k successes in n trials, each trial has success probability of p. The
binomial PMF is painful to handle for large n, so we approximate. E[X] = np and Var [X] =
npq the equivalent Normal distribution is N(np, npq).Consider a binomial with mean = np and variance = npq, with a range of interest from
a X b. It may be approximated with a normal, N(np, npq) from a 1/2 X b + 1/2.For the special case: a = b, P(X = a) = area under normal curve from a 1/2 to a + 1/2.
At the extremes:
Binomial Normal
P(X = 0) area to the left of 1/2
P(X = n) area to the right of n
1/2Use when npq
5, n, p, q
all moderate.
Approximating the Poisson with a normal
Poisson is difficult to use for large . PX(X = x) N(, ) where for the Poisson, mean =var = . For the Normal, we take the area from x 1/2 to x + 1/2 or the area to left of 1/2for x = 0. Use this approximation for 10.
Summary of approximations
1. Approximating Binomial with Normal:
Rule of Thumb: npq 5, n, p, q all moderately valued.2. Approximating Poisson with Normal:
10
3. Approximating Binomial with Poisson: n 100, p 0.01(Obviously, both Poisson and Normal may be used for n 1000, p = 0.01)
7/30/2019 Univariate Stats
19/52
A19 Continuous probability distributions
Beta distribution
The Beta distribution has as its pdfUnlike the binomial, a and b
need not be integers. PMF(x) =(a + b)
(a)(b)xa1(1 x)b1 0 x 1 (A.50)
(a + 1) = a(a) in general and when a is an integer, using this relationship recursively givesWhen a = b = 1, Beta(x;1,1)
= Uniform distribution.
us (a + 1) = a!. It is important to appreciate that by varying a and b, the pdf could attain
various shapes. This is particularly useful when we look for a pdf of an appropriate shape.
E[x] =10
xf(x)dx =(a + b)
(a)(b)
10
xa(1 x)b1dx
=(a + b)
(a)(b)
(a + 1)(b)
(a + b + 1)
10
(a + b + 1)
(a + 1)(b)xa(1 x)b1dx
=(a + b)
(a)(b)
(a + 1)(b)
(a + b + 1)=
a
a + b
Similarly
Ex2 = a(a + 1)(a + b + 1)(a + b)
Var [x] =ab
(a + b)2(a + b + 1)
7/30/2019 Univariate Stats
20/52
BAppendix B Estimation and
Hypothesis Testing
B.1 Samples and statistics
Theres no sense in being
precise when you dont even
know what youre talking
about. John von Neumann
We sample to draw some conclusions about an entire population.Examples: Exit polls, census.
We assume that the population can be described by a probability distribution. We assume that the samples give us measurements following this distribution,
We assume that the samples are randomly chosen. It is very difficult to verify that a sample
is really a random sample. The population is assumed to be (effectively) infinite.
Sampling theory must be used if the population is actually finite. Simple random sample: each member of a random sample is chosen independently and has
the same (nonzero) probability of being chosen.
Cluster sampling: If the entire population cannot be uniformly sampled, divide it intoclusters and sample some clusters.
Clinical Trials and Randomization
Block randomization: one block gets the treatment and another the placebo. Blind trials: Patients do not know what they get. Double blind trials: both doctors and patients do not know what they get.Distributions and Estimation
We wish to infer the parameters of a population distribution. mean, variance, proportion etc.
We use a statistic derived from the sample data. Sample mean, sample variance etc.
A statistic is a random variable. What distribution does it follow?
What are its parameters?
Notation
Population parameters are typically in Greek. = population mean.
2 = population variance.
The corresponding statistic is not in Greek! x = sample mean.
s2 = sample variance.
We can also denote an estimate of the parameter as .
Regression:
You want: y = x + .
You get: y = ax + b.
B1cSBN, IITB, 2012
7/30/2019 Univariate Stats
21/52
B2 Estimation and Hypothesis Testing
Why do we infer using the sample mean?
Consider a population with mean and variance 2. We have n measurements, x1, x2, . . .,
xn.
x =x1 + + xn
n(B.1)
Its all about expectationsE[xi] = Var [xi] = 2 (B.2)
E[x] = E
x1 + + xnn
=
E[x1] + + E[xn]n
=n
n= (B.3)
E[xi] = E[x] = (B.4)
x is an unbiased estimator of . For a statistic to be unbiased
E
= (B.5)
The population mean has other unbiased estimators such as the sample median, (min +
max)/2 and the trimmed mean. x is not a robust estimator as it is sensitive to outliers.
Nevertheless x is the preferred estimator of . x is the estimator with the smallest variance.
What is the variance of x?
We had Var [xi] = 2.
Var[x] = Var x1 + + xnn
=
1
n2(Var [x1 + + xn])
=1
n2(Var [x1] + + Var [xn]) Independence!
=1
n2n2 =
2
n(B.6)
xi or x?
E[xi] = Var [xi] = 2 (B.7)
E[x] = Var [x] = 2n
(B.8)
If x N(0, 1), then x N
0,1
n
The sample mean depends on n. Therefore we must sample more and gain accuracy!
Fig. B.1: The distribution of x
Intuitively, we desire replica-
tion of results.
The population size can matter
For an infinite population, Var [x] = 2/n. For a finite population of size N, we need a
correction factor.
More on this in a separate
handout.
Var [x] =2
n
N nN 1 (B.9)
7/30/2019 Univariate Stats
22/52
B3 Samples and statistics
The Central Limit Theorem
Theorem B.1 The sum of a large number of independent random variables has a distribu-
tion that is approximately normal.
Let x1, . . ., xn be i.i.d. with mean and variance 2. For large n
(x1 + + xn) N(n, n2) (B.10)
P
(x1 + + xn) n
n
P(z < x) (B.11)
Consider the total obtained on rolling several (n) dice (Fig. B.2). As n increases, the distri-
bution changes to a Gaussian curve!
Fig. B.2: Rolling n dice.
CLT and the Binomial
Let X = x1 + + xn, where xi is a Bernoulli variable.E[xi] = p Var [xi] = p(1 p) (B.12)
E[X] = np Var [X] = np(1 p) (B.13)
CLT suggests that for large n
X npnp(1 p) N(0, 1) (B.14)
The Binomial can be approximated with a normal if np(1 p) 10. For p = 0.5, npq 10implies n 40! In general, the normal approximation may be applied if n 30. So keep
Remember: even if an in-
dividual measurement does
not follow a Normal distri-
bution, the measured sample
mean does.
sampling!
Sample variance
The population variance is
2 =
Ni=1(xi )2
N(B.15)
The sample variance is
s2 =ni=1(xi x)2
n 1 (B.16)
We had E[x] = . But is ES2 = 2? We compute ES2 as follows:Why n 1?
(xi x)2 = (xi + x)2= (xi )2 + ( x)2 + 2(xi )( x)
ni=1
(xi x)2 =ni=1
(xi )2 +ni=1
( x)2 + 2ni=1
(xi )( x) (B.17)
The middle term is ...
ni=1
( x)2 = n( x)2
7/30/2019 Univariate Stats
23/52
B4 Estimation and Hypothesis Testing
The last term is ...
2
ni=1
(xi )( x) = 2(x )ni=1
(xi ) = 2n(x )2
n
i=1
(xi
x)2 =n
i=1
(xi
)2 + n(
x)2
2n(x
)2
=ni=1
(xi )2 n(x )2ni=1(xi x)2
n=
ni=1(xi )2
n (x )2
Taking the expectation on each side
En
i=1(xi x)2n
= En
i=1(xi )2n
E(x )2 (B.18)
But,
En
i=1(xi )2n
= 2 (B.19)
E(x )2 = Var [x] = 2n
(B.20)
This implies
En
i=1(xi x)2n
= 2
2
n=
(n 1)2n
(B.21)
We have a biased estimator! This means that
E
ni=1(xi x)2n 1
= 2 (B.22)
Therefore we define an unbiased estimator s2 as
s2 =
ni=1(xi x)2
n 1 (B.23)
We lost a degree of freedom.
Where? 2 =
Ni=1(xi )2
ns2 =
ni=1(xi x)2
n 1
B.2 Point and Interval Estmation
Point estimates
Given xi N(, 2), the preferred point estimate of is x. The preferred point estimate of2 is s2. We know how x behaves:
x N
,2
n
x /
n
N(0, 1) (B.24)
We do not know how s2
behaves yet. Similarly, intuition leads us to the best estimate ofthe binomial proportion p as p = X/n. The maximum likelihood estimation approach below
provides a method for the determination of parameter point estimates.
7/30/2019 Univariate Stats
24/52
B5 Point and Interval Estmation
Optimal estimation approaches
The MLE of a parameter is that value of that maximizes the likelihood of seeing the
observed values, i.e. P(D|). Note that instead we could desire to maximize P(|D): i.e. wechoose the most probable parameter value given the data that we have already seen. This is
Maximum a posteriori (MAP) estimation. The MAP estimate is typically more difficult to
obtain because we need to rewrite it in terms of P(D|) using Bayes law, and p() needs to bespecified. Also note that if P() is effectively a constant (i.e. is a uniform RV), then MLE
= MAP. There are other approaches towards obtaining estimates of parameters including
the confusingly named Bayesian estimation (different from MAP), maximum entropy etc.
In general note that we want estimates with small variances, and hence minimum variance
estimates are popular. In the discussion that follows, we use almost always use MLE; we will
however encounter MAP at least once later.
MLE of the Binomial parameter p:
Let X = random variable describing a coin flip, X = 1 if heads, X = 0 if tails. For one coin
flip,
As before, the Bernoulli and
the Binomial.f(x) = px(1 p)1x (B.25)
For n coin flips,
f(x1, x2,...,xn) =
ni=1
pxi(1 p)1xi (B.26)
The log-likelihood may be written as
l(p|xi,...,xn) =ni=1
[xi lnp + (1 xi)ln(1 p)] (B.27)
The MLE of p = p and is obtained from
The answer here is obvious.MLE does this quite often:
it gives us intuitive answers.
However as the next exam-
ple shows, MLE could be bi-
ased.
d l(p)
dp= 0
xi
p
(1 xi)1 p = 0
which can be rearranged to give
p =
ni=1 xin
(B.28)
The MLE of the binomial proportion (X heads in n tosses) is
p =X
nand E[p] = p (B.29)
An MLE perspective on x and s2, for a Gaussian
Let D = {x1, . . . , xn} be a dataset of i.i.d. samples from N(, 2). Then the multivariatedensity function is
f(D) =ni=1
e(xi)2/22
2=
1
n(2)n/2exp
1
2
ni=1
xi
2(B.30)
f(D) is the likelihood of seeing the data. l = ln[f(D)] is the log likelihood of seeing the data.
Which estimates of and 2 would maximize the likelihood of seeing the data? Set
l
= 0
l
2= 0 (B.31)
and solve for and 2.
l
= 0
ni=1(xi )2
= 0 =
i xin
= x (B.32)
7/30/2019 Univariate Stats
25/52
B6 Estimation and Hypothesis Testing
The MLE of is = x. The MLE of 2 however is a surprise:
l
2= 0 = n
22+
1
22ni=1(xi )2
2
1
22 ni=1(xi )2
2 =
1
24
ni=1
(xi )2
=1
24
ni=1
[(xi x) + (x )]2
=1
24
ni=1
(xi x)2 + n(x )2 + 2(x )ni=1
(xi x)
=1
24
nS2 + n(x )2 where S2 is defined as = ni=1(xi x)2n
l
2= 0 =
n
22+
1
22 ni=1(xi )2
20 = n
22+
n
24
S2 + (x )2But x = from before, and hence
The MLE of 2 is biased!
2ML = S2 =
ni=1(xi x)2
n(B.33)
Interval Estimation
We have point estimates of and 2. But remember that x and 2 are random variables. We
could use bounds: for example, x =
. should depend on 2.
Chebyshevs inequality for a random variable x
Theorem B.2
P(|xi | > k) 1k2
(B.34)
2 = E(x )2 = R1,R2,R3
(xi E[x])2p(xi)
R1
(xi E[x])2p(xi) +R3
(xi E[x])2p(xi)
R1
k22p(xi) +R3
k22p(xi)
1
k2R1
p(xi) +R3
p(xi)4 2 0 2 4 6
0.
0
0.
1
0.
2
0.
3
0.
4
k + k
Fig. B.3: Chebyshevs theorem P(|xi | > k) 1k2
This applies to ANY distribution. For k = 2,
Chebyshevs theorem suggests: P(|xi | > 2) 0.25.
For a Gaussian, P(
|xi
|> 2)
0.05.
Clearly, if the density function is known, we should be using it instead of Chebyshevs theorem,
to determine probabilities.
3 x 2 + + 2 + 3
68%
95%
99.7%
f (x)
i . .
Fig. B.4: Intervals on a Gaus-
sian
7/30/2019 Univariate Stats
26/52
B7 Point and Interval Estmation
Chebyshevs inequality for the derived variable x
P
|x | > k
n
1
k2(B.35)
Let k/
n = = some tolerance.
P(|x | > ) 2
n2 P(|x | < ) 1
2
n2(B.36)
This is the law of large numbers: For a given tolerance, as n , then X .
Which estimate intervals do we want?
Chebyshevs inequality gives us a loose bound. We want better estimates. For a normal
distribution,
1. interval estimate of with 2 known,
2. interval estimate of with 2 unknown,
3. interval estimate of2 with unknown
For a binomial distribution,
4. interval estimate ofp
1. Normal: interval for , 2 known
We have n normally distributed points.
x N(, 2)
x N
,2
n
x /
n
N(0, 1)
Recall the z notation for a threshold:
P(z > Z) =
For a two-sided interval,
Fig. B.5: One sided interval
Fig. B.6: Two sided intervals
P
z/2 < x /n < z/2
= 1 (B.37)
P
1.96 < x
/
n< 1.96
= 0.95 (B.38)
We can rearrange the inequalities in
Pz/2 < x /n < z/2 = 1
to get
P
x z/2 n < < x + z/2
n
= 1 (B.39)
So what does a confidence interval x z/2 n mean?: The maximum error of the estimateis |X | < z/2/
n with probability 1 . What does a confidence interval mean for a
Frequentist? The two-sided confidence interval at level 100(1 ) was3 2 1 0 1 2 30.5
0.0
0.5
1.0
value
Probability
| |
| |
| |
| |
| |
Fig. B.7: Confidence intervals x z/2 n (B.40)
We can have lower and upper one sided intervals.x z
n,
(B.41)
7/30/2019 Univariate Stats
27/52
B8 Estimation and Hypothesis Testing
, x + z
n
(B.42)
What should the sample size (n) be, if we desire x to approach to within a desired level of
confidence (i.e. given )? Rearrange and solve for n:
n =
z/2|x |2
(B.43)
Does the dependency of n on the various terms make sense? You need more samples if
1. you want x to come very close to ,
x
E = error = x
u = x + z/2 / n l = x z/2 / n
tFig. B.8 Intervals on the Gaussian
2. is large,
3. z/2 is increased (i.e. is decreased)
3 x 2 + + 2 + 3
68%
95%
99.7%
f (x)
i . .Fig. B.9: Effect of
2. Normal: interval for , 2 unknown
The two sided interval for , when 2 is known is x z/2/
n. What happens when 2 is
unavailable? We must estimate 2. Its point estimate is s2. However, theres a problem w.r.t.
interval estimation:
x /
n
follows the normal N(0, 1)
Butx s/
n
does not follow N(0, 1)
In fact, we need a new sampling distribution to describe how (x )/(s/n) varies. We needto acknowledge that our estimate s2 is a random variable (RV), with its own distribution
centered around mean 2. Therefore we have a range of values for s2 which could be used
to substitute for 2 in N(, 2) and therefore we must average across several Gaussians,which have the same mean but with different variances. The new distribution must have the
following properties:
The Gaussians are symmetric about : therefore a weighted average of Gaussians must be
symmetric about .
Tail: it should have a thick tail, for small n: greater variability than N. As n , we should be back to N(, 2). The distribution is a function of sample size n. We have a family of curves!
William Gosset: a chemist
at the Guinness breweries in
Dublin, came up with the
Students t-distribution!
We know ofN(x|, 2). But 2 is unknown, and if we estimate it, we end up with a distributionof variance values. It is easier to work with the precision :
=1
2(B.44)
The univariate Normal distribution is
f(x) =12
exp
1
2
x
2=
1/22
exp
2(x )2
(B.45)
7/30/2019 Univariate Stats
28/52
B9 Point and Interval Estmation
We have a dependency on of the form
1/22
exp
2(x )2
The Gamma distribution is
Gam(x|, ) =
x
1
() exp(x) x > 0, > 0, > 0 (B.46)where
() =
0
eyy1dy (B.47)
() = ( 1)( 1). If is an integer, (n) = (n 1)!. For a Gamma variable, the = 1: (1) = 1.
The Gamma distribution be-
comes an exponential distri-
bution.
moment generating function is
(t) = Eetx = ()
0
etxexx1dx =
t
(B.48)
E[x] =
(0) =
(B.49)
Var [x] = =
(0) (E[x])2 = 2
(B.50)
The precision follows a Gamma distribution. We had
Fig. B.10: Gamma distribution
N(x|, 2) =N(, 1) (B.51)We want the average of several distributions:
EN(, 1) (B.52)
which is
0 Nx
|, 1Gam(|
, )d (B.53)
This averaged distribution, can be shown to be (details in another handout on Moodle)
+ 12
()
1
2
1/2 1 +
(x )22
12
(B.54)
This is the Students t-distribution, St(x|,,) where 2 signifies the degrees of freedom.
Fig. B.11: The t distribution
Fig. B.12: The t and N(0, 1)distributions
Notice the symmetry about 0 in Figs B.11 and B.12. Quantiles corresponding to the 95%
t0k,t k,t1 k, t=
i
Fig. B.13: Intervals on the t dis-
tribution
confidence interval for the t-distribution, as a function of the degrees of freedom, are shown
in Table B.1. For n 30, t N(0, 1). Now
t0.025,
4 2.776
9 2.262
x /
n
follows the N(0, 1) curve, but x s/
n
follows the tn1 curve
The interval for when 2 was known, was
x n
z/2
The interval for when 2 is unknown is
x sn
t/2,n1 (B.55)
Similarly, lower and upper one sided intervals arex t,n1 s
n,
(B.56)
, x + t,n1 s
n
(B.57)
7/30/2019 Univariate Stats
29/52
B10 Estimation and Hypothesis Testing
Example B.3 A fuse manufacturer states that with an overload, fuses blow in 12.40 minutes
on average. As a test, take 20 sample fuses and subject them to an overload. A mean time of
10.63 and a standard deviation of 2.48 minutes is obtained. Is the manufacturer right?
Soln:
t = x s/n = 10.63 12.402.48/20 = 3.19
From a t table,
P(t19 > 2.861) = 0.005 P(t < 2.861) = 0.005Hence the manufacturer is wrong and the mean time is probably (with at least 99% confidence)
below 12.40 minutes.
3. Normal: interval for 2
If (x1, . . . , xn)
N(, 2). The point estimate of 2 is s2
s2 =
ni=1(xi x)2
n 1 and E
s2
= 2 (B.58)
We have
zi =(xi )
N(0, 1) (B.59)
Let
X = z21 + + z2n (B.60)X follows a 2 distribution with n degrees of freedom.
X 2n (B.61)The 2 with n degrees of freedom is related to the Gamma distribution. The 2 distribution0 5 10 15 20 25 x
k = 10
k = 5
k = 2
f(x)
i
Fig. B.14: The 2 distribution.has x always > 0 and is skewed to the right. We have
It is not symmetric. The skew
as n .
ni=1(xi )2
2 2n (B.62)
When we substitute with x ni=1(xi x)2
2 2n1 (B.63)
But
s2 =
ni=1(xi x)2
n 1(n 1)s2
2 2n1 s2
2
n 1 2n1 (B.64)
We compute intervals, using quantiles, as before. No symmetry we need to evaluate bothquantiles for a two sided interval.
(n 1)s22
2n1 (B.65)
P
(n 1)s22/2,n1
< 2 0
H1 : = 1 (= 0)
Do we need to pay attention to H1? Yes. H1 influences the interval estimate. To see that we
need to understand what means. But before that, why should H0 be the equality statement?Because the moment we say something like H0 : = 0 we have a specific distribution to
work with! The equality statement carries the burden of proof. The legal system prefers that4 2 0 2 4
0.0
0.
1
0.
2
0.
3
0.
4
Ho Accepted
0
Fig. B.16: Hypothesis accep-
tance
we try and disprove H0 : X is innocent. Typically, the accused is innocent (H0) until proven
guilty (H1) beyond reasonable doubt ().
What is ?
The error of the estimate is |x |.This maximum error is < z/2/
n with probability 1 There are 4 possible outcomes to
Fig. B.17: The error
a test of H0 vs. H1.
1 Accept H0 as true, when in fact it is true.
2 Accept H0 as true, when in fact it is not true (i.e. H1 is true).
3 Reject H0 as true, when in fact H0 is true.
4 Reject H0 as true, when in fact H0 is not true (i.e. H1 is true).
Truth
H0 H1
Decision H0 true, H0 accepted H1 true, H0 accepted
Taken H0 true, H1 accepted H1 true, H1 accepted
Type I error: H0 true, but H1 accepted = Remember that = signif-
icance level of the test ap-
plied.Where theres an ...
...theres a . Type II error: H1 true, but H0 accepted = . Assuming H1 : = 1, with
1 > 0, 1 is the power of the test. depends on 1. Increase |01| and 1 increases
H0 H1
2 2
0 1
Fig. B.18: and
powerful test. We accept H0 in the range
=|x 0|
/
n z/2 = 0 z/2 n x 0 + z/2
n
(B.85)
But is the area in this range, under H1. We need to convert the thresholds to standardized
coordinates first, this time under H1. Under H1,
x N
1,2
n
(B.86)
7/30/2019 Univariate Stats
33/52
B14 Estimation and Hypothesis Testing
The location 0 z/2/
n becomes (under H1)
=
0 z/2/
n 1
/
n=
0 1/
n
z/2 (B.87)
Similarly, the other threshold is
Fig. B.19: The acceptance in-terval
=0
1
/n + z/2 (B.88)Therefore
= P
0 1/
n
z/2 z 0 1/n + z/2
=
0 1/
n
+ z/2
0 1/
n
z/2
(B.89)
If 1 > 0, then the second term 0.
0 1/
n
+ z/2
= P(z > z) = P(z < z) = (z) (B.90)
z (0 1)
n
+ z/2 (B.91)
n (z/2 + z)22
(1 0)2 (B.92)
The sample size depends on
1. |0 1|. As |0 1| increases, n decreases.
46
0.3
0.4
0.5
0.6
0.2
0.1
048 50 52 54 56
UnderH0: = 50 UnderH1: = 52
Probabilitydensity
x
i . .
46
0.3
0.4
0.5
0.6
0.2
0.1
048 50 52 54 56
UnderH0: = 50
UnderH1: = 50.5
Probabilitydensity
x
i . .
tFig. B.20 Effect of |0 1|
2. 2. As 2 increases, n increases.
4 2 0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Ho H
4 2 0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
Ho H
tFig. B.21 Effect of 2
3. . As decreases, z/2 increases, hence n increases.
4 2 0 2 4 6 80.1
0.0
0.1
0.2
0.3
0 1
4 2 0 2 4 6 80.1
0.0
0.1
0.2
0.3
0 1
4 2 0 2 4 6 80.1
0.0
0.1
0.2
0.3
0 1
tFig. B.22 Effect of
4. . As required power increases (1 increases), n increases.
7/30/2019 Univariate Stats
34/52
B15 Hypothesis Testing
4 2 0 2 4 6 80.1
0.0
0.1
0.2
0.3
0 1
4 2 0 2 4 6 80.1
0.0
0.1
0.2
0.3
0 1
tFig. B.23 Effect of
One-sided vs. two-sided tests
Consider the null hypothesis H0 : = 0: When H0 : = 0 and H1 : > 0, we should
(a)
0
N(0,1)
z /2z /2 Z0
/2/2 Acceptance
region
Critical region
(c)
0
N(0,1)
z Z0
Acceptance
region
(b)
0
N(0,1)
z Z0
Critical region
Acceptance
region
Critical region
i . . ,
H1 : = 0 H1 : > 0 H1 : < 0
tFig. B.24 One-sided vs. two-sided tests
Which is more conservative:
a two-sided test or a one-
sided test?
accept H0 ifx 0/
n
z (B.93)
reject H0 ifx 0/
n
> z (B.94)
p-value
The p-value is the significance level when no decision can be made between accepting andrejecting H0. Find the probability of exceeding the test statistic (in absolute value) on a unit
normal.
p-value = p
|z| > |x 0|
/
n
(B.95)
4 2 0 2 4
0.
0
0.
1
0.
2
0.3
0.4
Fig. B.25: The p valuep-value and one-sided tests
For H0 : = 0 and H1 : = 0, we would reject if the test statistic, t is > z/2.
p-value = 2 p(z |t|) For H0 : 0 and H1 : > 0, we would reject if the test statistic, t is > z.
p-value = p(z t)
For H0 : 0 and H1 : < 0, we would reject if the test statistic, t is < z.
p-value = p(z t)
Why is it important to re-
port the p-value? Why not
just report the test result
(yes/no) based on the chosen
?
When you perform a hypothesis test, you should report
1. the significance level of the test that you chose,
2. AND the p-value associated with your test statistic.
7/30/2019 Univariate Stats
35/52
B16 Estimation and Hypothesis Testing
A problem solving tip
For 1 > 0, and for a two-sided test,
0 1/
n
+ z/2
(B.96)
Observe that
= f(|0 1|, 2, n, ) (B.97)
The sample size n is
n (z/2 + z)22
(1 0)2 (B.98)
and depends on
n = f(|0 1|, 2, , ) (B.99)
Most problems require us to solve for one of the following in terms of the rest:
|0 1|, 2, , , n
, power and rejection of H0As we increase 1, decreases. The curve described by (1) is called an operating charac-
teristic, (OC). For a fixed value of , plot
(1) vs.|0 1|
/
n(B.100)
The Operating Characteristic curve for a two-sided test with = 0.05.
(1) vs.|0 1|
/
n
Fig. B.26: The OC curve.
One sided tests and
We had, when H0 : = 0 and H1 : > 0,
(1) = P (accepting H0 | H1 : = 1 is correct) (B.101)
We switch to the more general notation, replacing 1 with
beta() = P (accepting H0 | H1 is correct) (B.102)
4 2 0 2 4 6 80.
1
0.0
0.
1
0.2
0.
3
0 1
Fig. B.27: Finding .
() = P
x 0 + z
n
= P
x /
n
0 /
n
+ z
= P
z 0
/
n+ z
, z N(0, 1)
=
0 /
n
+ z
(B.103)
Compare this against for a two sided interval: z/2 has been replaced with , and 1 has
been replaced with the more generic . As its argument increases, increases. Therefore as
increases, () decreases. The larger is, the less likely it would be to conclude that
= 0, or for that matter 0.
(0) = 1
7/30/2019 Univariate Stats
36/52
B17 Parametric hypothesis testing
H0 and H1To test H0 : =0 and H1 : > 0, we compared
x 0/
n
vs. z
If instead, we wished to test H0 :
0 and H1 : > 0, what changes must we make to the
test procedure? How did H0 : = 0 influence the testing procedure?
We put down the H0 curve centered around 0, and then chose and identified an interval based on it.We meant that the probability of rejection of H0, should be (the small valued) . As we
relax H0 to now be 0 we need to ensure that the probability of rejection of H0 is nevergreater than . But the probability of accepting H0 for a given is (). Therefore, for the
probability of rejection to be never greater than , we would need
1 () for all 0
which implies that we would need
() 1 for all 0But, as seen before, () decreases as increases. Therefore () (0). But (0) = 1,
4 2 0 2 4 6 80.1
0.
0
0.
1
0.
2
0.3
0 1
Fig. B.28: and and therefore we confirm that
() = 1 for all 0Therefore. a test for H0 : = 0 at level also works for H0 : 0 at level .
B.4 Parametric hypothesis testing
Overview of Parametric Tests discussed
1. Testing the mean of a Normal Population variance known (z-Test)
2. Testing the mean of a Normal Population variance unknown (t-Test)
3. Testing the variance of a Normal Population (2-Test)
4. Testing the Binomial proportion
5. Testing the Equality of Means of Two Normal Populations, variances known
6. Testing the Equality of Means of Two Normal Populations, variances unknown but assumed
equal
7. Testing the Equality of Means of Two Normal Populations, variances unknown and found
to be unequal
8. Testing the Equality of Variances of Two Normal Populations (F-Test)
9. Testing the Equality of Means of Two Normal Populations, the Paired t-Test.
The approach for any Parametric Test
Step 1. State the hypothesis.
a. Identify the random variable and the relevant parameter to be tested.
b. Clearly state H0 and H1.
c. Identify the distribution that applies under H0.
Step 2. Identify the significance level () at this stage.
Step 3. Based on H1 and identify the critical threshold(s).
7/30/2019 Univariate Stats
37/52
B18 Estimation and Hypothesis Testing
a. The relevant (standardized) sampling distribution needs to be identified knowing sample
size/degrees of freedom.
b. One-sided or two-sided test?
Step 4. Compute the test statistic from the sampled data.
Step 5. Identify the test result given the critical level (based on ).Step 6. Report the test result, and the p-value.
1. The z-Test: One sample test for when the variance is known
Step 1. State the hypothesis.
H0 : = 0, H1 : = 0.Under H0, x N(0, 2)
Step 2. Identify the significance level () at this stage.
We set = 0.05.
Step 3. Based on H1 and identify the critical threshold(s).At = 0.05, on the Standard Normal, for a two-sided test, z/2 = 1.96.
Step 4. Compute the test statistic from the sampled data.
z =x 0/
n
vs. N(0, 1)
Step 5. Identify the test result given the critical level (based on ).
Step 6. Report the test result, and the p-value.
2. The t-Test: One sample test for when the variance is unknownStep 1. State the hypothesis.
H0 : = 0, H1 : = 0.Under H0, x N(0, 2). 2 is unknown. Standardize and use t/2,n1 curve.
Step 2. Identify the significance level () at this stage.
We set = 0.05.
Step 3. Based on H1 and identify the critical threshold(s).
Use = 0.05, on the tn1 curve
Step 4. Compute the test statistic from the sampled data.
t =x 0s/
n
vs. tn1
Step 5. Identify the test result given the critical level (based on ).
Step 6. Report the test result, and the p-value.
Example B.4 12 rats are subjected to forced exercise. Each weight change (in g) is the
weight after exercise minus the weight before. Does exercise cause weight loss?
1.7, 0.7, -0.4, -1.8, 0.2, 0.9,
-1.2, -0.9, -1.8, -1.4, -1.8, -2.0
7/30/2019 Univariate Stats
38/52
B19 Parametric hypothesis testing
Soln:
Step 1. H0 : = 0, H1 : = 0.Step 2. Use = 0.05
Step 3. t0.025,11 = 2.201
Step 4. n = 12. x =
0.65 g and s2 = 1.5682 g2.
x s/
n
= 1.81
Step 5. Since |t| < |t/2,n1| then H0 cannot be rejected.Step 6. p-value = ?
Example B.5 The level of metabolite X measured in 12 patients 24 hours after they
received a newly proposed antibiotic was 1.2 mg/dL. If the mean and standard deviation of X
in the general population are 1.0 and 0.4 mg/dL, respectively, then, using a significance level
of 0.05, test if the level of X in this group is different from that of the general population.
What happens if = 0.1?
Soln:
Step 1. H0 : = 1.0, H1 : = 1.0.Step 2. Use = 0.05
Step 3. z0.025 = 1.96
Step 4. n = 12, x = 1.2, = 0.4
z =x /
n
= 1.732 vs. 1.96
Step 5. H0 cannot be rejected at = 0.05.
Step 6. p-value = 2 P(z > 1.732) = 2 0.0416 = 0.0832What happens at = 0.1?
Example B.6 (Example B.5 contd.) Suppose the sample standard deviation of metabo-
lite X is 0.6 mg/dL. Assume that the (population) standard deviation of X is not known, and
perform the hypothesis test.
Soln:
Step 1. H0 : = 1.0, H1 : = 1.0.Step 2. Use = 0.05
Step 3. t0.025,11 = 2.201
Step 4. n = 12, x = 1.2, s = 0.6
t =x s/
n
= 1.155 vs. 2.201
Step 5. Cannot reject H0.
Step 6. P(t < 1.155) on a t11 curve is 0.8637.
p = 2 (1 0.8637) = 0.272
Example B.7 The nationwide average score in an IQ test = 120. In a poor area, 100 studentscores are analyzed and mean = 115, and standard deviation is 24. Is the area average lower
than the national average?
7/30/2019 Univariate Stats
39/52
B20 Estimation and Hypothesis Testing
Soln:
Step 1. H0 : = 120 vs. H1 : < 120
Step 2. Use = 0.05.
Step 3. t0.05,99 = 1.66. z0.05 = 1.646. Use either value.
Since we have H1 : < 120, our critical level is -1.66Step 4.
t =x s/
n
=115 12024/
100
= 2.08 vs. 1.66
Step 5. We reject H0 at a significance level of 0.05.
Step 6. The p-value is P(t99 < 2.08) = 0.02. Result is statistically significant.
Could not have rejected H0 at significance level 0.01.
Example B.8 (Example B.7 contd.) Suppose 10,000 scores were measured with mean
= 119 and s = 24. Redo the test.
Soln:
Step 1. H0 : = 120 vs. H1 : < 120
Step 2. Use = 0.05.
Step 3. t0.05,9999 = z0.05 = 1.646.
Since we have H1 : < 120, our critical level is -1.646
Step 4.
t = x s/
n
= 119 12024/
10000
= 4.17 vs. 1.646
Step 5. We reject H0 at a significance level of 0.05.
Step 6. The p-value is P(t9999 < 4.17) = P(z < 4.17) < 0.001. Result is very statistically significant.
Result is probably not scientifically significant.
Example B.9 A pilot study of a new BP drug is performed for the purpose of planning a
larger study. Five patients who have a mean BP of at least 95 mm Hg are recruited for thestudy and are kept on the drug for 1 month. After 1 month the observed mean decline in BP
in these 5 patients is 4.8 mm Hg with a standard deviation of 9 mm Hg.
If d= true mean difference in BP between baseline and 1 month, then how many patients
would be needed to have a 90% chance of detecting a significant difference using a one-tailed
test with a significance level of 5%?
Example B.10 (Example B.9 contd.) We go ahead and study the preceding hypothesis
based on 20 people (we could not find 31). What is the probability that we will be able to
reject H0 using a one-sided test at the 5% level if the true mean and standard deviation of
the BP difference is the same as that in the pilot study?
7/30/2019 Univariate Stats
40/52
B21 Parametric hypothesis testing
3. The 2-Test: One sample test for the variance of a Normal population.
Step 1. State the hypothesis.
H0 : 2 = 20 , H1 :
2 = 20 .Standardize s2 and use 2n1 curve.We need both thresholds: 2
/2,n1and 2
1/2,n1Step 2. Identify the significance level () at this stage.
We set = 0.05.
Step 3. Based on H1 and identify the critical threshold(s).
Use = 0.05, on the 2/2,n1 curve
Step 4. Compute the test statistic from the sampled data.
(n 1)s220
vs. 2n1
Step 5. Identify the test result given the critical level (based on ).Step 6. Report the test result, and the p-value.
Example B.11 We desire to estimate the concentration (g/mL) of a specific dose of
ampicillin in the urine after various periods of time. We recruit 25 volunteers who have
received ampicillin and find that they have a mean concentration of 7.0 g/mL, with a
standard deviation of 2.0 g/mL. Assume that the underlying distribution of concentrations
is normally distributed.
1. Find a 95% confidence interval for the population mean concentration.
Soln: (6.17, 7.83)
2. Find a 99% confidence interval for the population variance of the concentrations.
Soln: (2.11, 9.71)
Example B.12 (Example B.11 contd.) How large a sample would be needed to ensure
that the length of the confidence interval in part (a) (for the mean) is 0.5 g/mL if we assume
that the standard deviation remains at 2.0 g/mL?
Soln: n = 246
Example B.13 Iron deficiency causes anemia. 21 students from families below the poverty
level were tested and the mean daily iron intake was found to be 12.50 mg with standard
deviation 4.75 mg. For the general population, the standard deviation is 5.56 mg. Are the
two groups comparable?
Soln: 14.6 lies within (9.591, 34.170) and hence cannot reject H0.
Soln: Find the p-value yourself.
7/30/2019 Univariate Stats
41/52
B22 Estimation and Hypothesis Testing
4. A Test for the Binomial Proportion
One sample test for p assuming that the Normal approximation to the Binomial is valid
Step 1. State the hypothesis.
H0 : p = p0, H1 : p = p0.Under H0, x N(np0, np0(1 p0)).Standardize and use N(0, 1) curve.
Step 2. Identify the significance level () at this stage.
We set = 0.05.
Step 3. Based on H1 and identify the critical threshold(s).
Use = 0.05, on the z curve
Step 4. Compute the test statistic from the sampled data.
z =x np0
np0(1 p0)vs. z/2
Step 5. Identify the test result given the critical level (based on ).Step 6. Report the test result, and the p-value.
5. A Test for the equality of means of two Normal populations, when the variances are
known
Consider measurements xi and yi arising out of two different populations. We assume that
we have n measurements for xi and m for yi.
xi N
x, 2x
yi N
y,
2y
(B.104)
Then the point estimates of the means for the two populations are
x N
x,2xn
y N
y,
2ym
(B.105)
This therefore implies thatRemember that variances of
independent variables add
up.x y N
x y,
2x
n+
2ym
(B.106)
Step 1. State the hypothesis.
H0 : x y = 0, H1 : x y = 0.Under H0, x y follows Eq. B.106. Standardize and use N(0, 1) curve.
Step 2. Identify the significance level () at this stage.We set = 0.05.
Step 3. Based on H1 and identify the critical threshold(s).
At = 0.05, on the Standard Normal, for a two-sided test, z/2 = 1.96.
Step 4. Compute the test statistic from the sampled data.
z =(x y)
$$$$
$X 0(x y)
2xn +
2ym
vs. N(0, 1)
If 2x = 2y =
2 (as would be expected when the samples sets are expected to be coming
from the same population under H0), then
x y N
x y ,
1
n+
1
m
(B.107)
7/30/2019 Univariate Stats
42/52
7/30/2019 Univariate Stats
43/52
B24 Estimation and Hypothesis Testing
The difficulty that remains is identifying the degrees of freedom of the t distribution that
needs to be looked up. The recommended approach is to find the degrees of freedom d
as
follows:
d
= S2xn +
S2ym
2
(S2x/n)2n1 + (
S2y/m)
2
m1
(B.112)
d
= d
rounded down (B.113)
Step 1. State the hypothesis.
H0 : x y = 0, H1 : x y = 0.Under H0, x y follows Eq. B.106. Standardize and use td curve.
Step 2. Identify the significance level () at this stage.
We set = 0.05.
Step 3. Based on H1 and identify the critical threshold(s).
Use = 0.05, on the td curve
Step 4. Compute the test statistic from the sampled data using Eq. B.111.Step 5. Identify the test result given the critical level (based on ).
Step 6. Report the test result, and the p-value.
8. A Test for the equality of variances of two Normal populations (F-test)
When the population variances for the two Normal populations are unknown, we have (in the
previous two tests) developed tests for equality of the means, assuming that the variances are
either equal (Case 6) or unequal (Case 7). What we should have been doing, before testing
for the equality of the means, was to first ask whether the variances could be assumed equal,
especially considering that the methodology in Case 6 is simpler than that in Case 7. From
the discussion on the 2 distribution, we have a ratio of two 2 distributions (with in general
different degrees of freedom), which is a new, F, distribution, dependent on the two sets of
degrees of freedoms.
(n 1)S2x2x
2n1(m 1)S2y
2y 2m1 (B.114)
S2x/2x
S2y/2y
2n1/(n 1)
2m1/(m 1)= Fn1,m1 (B.115)
Under H0 : 2x =
2y, we get
S2x
S2
y Fn1,m1 (B.116)
Then H0 cannot be rejected ifNote that the outcome of the
F-test should not depend on
which data set is in the nu-
merator of Eq. B.116.
F1/2,n1,m1 S2x
S2y F/2,n1.m1 (B.117)
The choice of which set is labeled x and y is not important given the following equivalence
between two F distributions:
P
2n1/(n 1)
2m1/(m 1)> F,n1,m1
= P
2m1/(m 1)2n1/(n 1)
1
F,n1,m1
= 1 = P
2m1/(m 1)2n1/(n 1)
> F1,m1,n1
(B.119)
7/30/2019 Univariate Stats
44/52
B25 Parametric hypothesis testing
which shows that1
F,n1,m1= F1,m1,n1 (B.120)
The F-test is summarized as follows:
Step 1. State the hypothesis.
H0 : 2x =
2y , H1 :
2x = 2y .
Under H0, we expect Eq. B.116.
Step 2. Identify the significance level () at this stage.
We set = 0.05.
Step 3. Based on H1 and identify the critical threshold(s).
Use = 0.05, on the Fn1,m1 curveStep 4. Compute the test statistic, S2x/S
2y , from the sampled data.
Step 5. Identify the test result given the critical level (based on ).
Step 6. Report the test result, and the p-value.
Sample sizes for two sample tests of the meansConsider H0 : x = y and H1 : x = y. Under H0, the lower critical threshold is (instandardized coordinates) z/2. In the original coordinate system, relative to the H0 curve,this threshold is
$$$$
$X0(x y) z/2
2xn
+2ym
= z/2
2xn
+2ym
But this coordinate, relative to the H1 curve is known to be (after standardization), z.
z =
z/2
2xn +
2ym
(x y)
2xn +
2ym
(B.121)
This can be rearranged to get, using k = m/n,z/2 + z
x y
2=
n
2x + 2y/k
(B.122)
and hence, given and , the sample sizes required to differentiate between two populations
are
n =
z/2 + z
x y
22x +
2yk
(B.123)
m = nk (B.124)
9. A Test for the equality of means of two Normal populations, when the samples arepaired (the paired t-test)
Consider n pairs of before and after measurements, which reflect, for example, the efficacy
of some treatment. Then the n differences di can be used to compute the mean sample
difference on treatment (= d), and the sample standard deviation sd(di). We have one vector
of difference values, with the average difference expected to be zero, according to H0. Since
the population variation is unknown, we have a t test with n 1 degrees of freedom.
d Edsd(d)
=d B
0
Edsd(di)/
n
tn1 (B.125)
Then, the paired t-test may be summarized as follows.
Step 1. State the hypothesis.
H0 : E
d
= = 0, H1 : = 0.
7/30/2019 Univariate Stats
45/52
B26 Estimation and Hypothesis Testing
Under H0, we expect Eq. B.125 to hold.
Step 2. Identify the significance level () at this stage.
We set = 0.05.
Step 3. Based on H1 and identify the critical threshold(s).
Use = 0.05, on the tn1 curve
Step 4. Compute the test statistic, nd/sd(di), from the sampled data.Step 5. Identify the test result given the critical level (based on ).
Step 6. Report the test result, and the p-value.
B.5 Nonparametric hypothesis testing
When the data available to us
does not in any fashion obviously follow some distribution,
or when the data is ordinal in nature rather than cardinal,the parametric hypothesis tests previously described cannot be applied. We then resort to
nonparametric hypothesis tests. Two essential features of these tests are that we pay attention
to the median of a data set, rather than the arithmetic mean, and that we endeavor to identify a
statistic, like a proportion, or a rank, which can be expected to follow a parametric distribution.
In this manner, we drag things back towards the more convenient parametric hypothesis tests.
We review below three nonparametric tests which will be seen to be counterparts of some of
the two sample parametric tests already discussed.
1. The Wilcoxon sign test.
2. The Wilcoxon signed rank test.3. The Wilcoxon rank sum test.
The Wilcoxon sign test
The paired t-test is ap-
plied under the assumption
that all measurements follow
some Normal distribution. If
the underlying distribution
is not Normal, or if it is un-
known, then this test is a
safer test.
The Wilcoxon sign test is the nonparametric counterpart to the paired t test, and is usually
invoked to evaluate whether two data sets are similar. Let di indicate the relative like or
dislike for two categories (icecreams). Then we have 3 categories of data: di < 0 where flavor
A is preferred, di > 0 with B being preferred, and di = 0 with equal preference for both
flavors. We wish to determine whether there is a marked preference for one flavor versus
the other. H0 would suggest an equal number of people preferring the two icecreams. The
actual numbers in the survey voting for the two icecreams should turn out to be close. Note
that those who cannot make up their minds, with di = 0, are irrelevant to the test of H0.
Therefore, if n scores have di = 0, of which c have di > 0, the test itself will be to see if c isclose enough to the median of the data set, n/2. If = E[c] is the population median of theunderlying distribution of di, then
H0 : E[c] = = n2
H1 : = n2
Under H0, the binomial proportion of finding nonzero di > 0 is 1/2. Further, assuming the
Binomial Normal approximation, Var [c] = np(1 p) 10, and under H0, since p = 1/2,we get n 40. Then we reject H0 if c E[c]sd(c)
=c n/2n/4
> z/2 (B.126)
7/30/2019 Univariate Stats
46/52
B27 Nonparametric hypothesis testing
A continuity correction is required because we are shifting over from a discrete to a continuous
distribution. H0 is rejected if
If c >n
2:
c n2
12
n/4
> z/2
If c < n2
:c n2 + 12n/4
> z/2 (B.127)
Example B.14 2 flavours of icecream undergo a taste test. 45 people are asked to choose,
and 22 prefer A, 18 prefer B, and 5 cannot make up their minds. Is one flavour preferred over
the other?
Soln:
n = 40, c = 18, the statistic is
(18 20) + (1/2)40/4
= 1.510
= 0.47
The p-value is 2 (1 (0.47)) = 2 (1 0.6808) 0.64. Hence H0 cannot be rejectedaccording to this test.
The Wilcoxon signed rank test
Rather than simply resort to proportions employed in the Sign test, we bring in some more
detail into the hypothesis test, but ordering the measurements and assigning ranks to them.
We then expect the ranks to be equally distributed between corresponding quantile groups
on either side of the median. This therefore ends up testing whether the distribution function
has similar shapes on either side of the median, as would be expected of H0 where the two
groups are supposed to be identical. The procedure is as follows:
1. Arrange the dis in order of absolute value and get frequency counts.
2. As in the sign test, ignore the di = 0 observations.
3. Rank the remaining observations from 1 to n.
4. Let R1 be the sum of ranks for group 1.
5. H0 assumes that the ranks are equally shared between the two groups, and H1 assumes
otherwise. Since n ranks are given out, the sum of all ranks is n(n + 1)/2, and hence therank sum for group 1 should be n(n + 1)/4 .
E[R1] = n(n + 1)4
, Var [R1] =n(n + 1)(2n + 1)
24(B.128)
6. For n 16, we assume that the rank sum follows a Normal distribution
R1 N
n(n + 1)
4,
n(n + 1)(2n + 1)
24
(B.129)
7. The test statistic T, bringing in the continuity correction again, is
|R1 n(n+1)
4 | 12n(n+1)(2n+1)
24
(B.130)
7/30/2019 Univariate Stats
47/52
B28 Estimation and Hypothesis Testing
8. A further correction factor is required for the variance, if there are subgroups with tied
ranks. The variance calculated above will be an overestimate. If g indicates the number of
tied groups, and ti the number of tied values in that group, then the corrected variance is
Var [R1] =n(n + 1)(2n + 1)
24
g
i=1
t3i ti48
(B.131)
Equation B.130 needs to be correspondingly updated.
Example B.15 (Example B.14 contd.) The 45 volunteers rate each icecream on a scale
from 1 to 10. (data in the margin). (10 = really really like it, 1 = dont care). Let di =
score(B) - score(A). Then di > 0 B is preferred.
di < 0 di > 0
|di| fi fi10 0 0
9 0 0
8 1 0
7 3 0
6 2 0
5 2 0
4 1 0
3 5 22 4 6
1 4 10
Soln:
di < 0 di > 0 Total # of Range of Average
|di| fi fi people ranks rank10 0 0 0
9 0 0 0
8 1 0 1 40 40
7 3 0 3 37-39 38
6 2 0 2 35-36 35.5
5 2 0 2 33-34 33.5
4 1 0 1 32 32
3 5 2 7 25-31 28.0
2 4 6 10 15-24 19.5
1 4 10 14 1-14 7.5
We conveniently choose R1 to be the group corresponding to di > 0 since it has fewer nonzero
frequencies (the arithmetic becomes simpler).
R1 = 10(7.5) + 6(19.5) + 2(28.0) = 248
E[R1] = 40 414
= 410
Var [R1] =40
41
81
24 (143
14) + (103
10) + (73
7) + (33
3) + 2
(23
1)
48 = 5449.75
sd(R1) =
5448.75 = 73.82
T =|(248 410)| (1/2)
73.82= 2.19
The p-values = 2(1 0.9857) 0.03 and hence H0 is rejected. Notice that by including theLikely because flavor A is
preferred.quantile level information, the signed rank test gives a different conclusion to that of the Sign
test. Since this test has involved more detailed analysis, the conclusion here is preferable to
that in Example B.14. Observe, from the method employed, that this test is the nonparametric
equivalent of the paired t test.
7/30/2019 Univariate Stats
48/52
B29 Nonparametric hypothesis testing
The Wilcoxon rank sum test
aka Mann Whitney test aka
U TestThe rank sum test is the nonparametric equivalent of the t-test for 2 independent samples.
The procedure involves assigning ranks after combining the data from the two groups. As
with the signed rank test, the rank sum for a group is compared with the expected rank sum.
Let R1 represent the rank sum in the first group, and let there be n1 and n2 samples in thetwo groups. Then, the rank sum of all the ranks is (n1 + n2)(n1 + n2 + 1)/2 and hence the
average rank for the combined sample is (n1 + n2 + 1)/2. Under H0,
E[R1] = n1 n1 + n2 + 12
Var [R1] = n1n2n1 + n2 + 1
12(B.132)
If n1 and n2 are 10, the rank sum may be assumed to have an underlying continuous(Normal) distribution:
R1 N
n1(n1 + n2 + 1)
2,
n1n2(n1 + n2 + 1)
12
(B.133)
and hence the appropriate test statistic is
T =|R1 n1(n1+n2+1)2 | 12
n1n2(n1+n2+1)12
(B.134)
Once again a correction factor needs to be applied to correct for groups of tied values. If g
indicates the number of tied groups, and ti the number of tied values in that group, then the
corrected variance is
Var [R1] =n1n2(n1 + n2 + 1 S)
12S =
gi=1
t3i ti(n1 + n2)(n1 +