Univariate Stats

7/30/2019 Univariate Stats

1/52

AAppendix A Introduction toProbability and Statistics

A.1 Introduction

If youve heard this story

before, dont stop me, be-

cause Id like to hear it

again. Groucho Marx

Many questions have probabilistic solutions. For example, are two objects/motifs similar

because of some relationship, or is it coincidence? How do you classify similar items? How

surprising is the occurence of an unusual event? How do we describe probabilistic outcomes

using models? What do we do when an exact model is not available? Which hypothesis/model

is more relevant/accurate/believable?

Before we talk models, we need to know about discrete and continuous random variables,

and the testing of hypotheses. Random experiments must satisfy the following:Random experiments

1. All possible discrete outcomes must be known in advance.

2. The outcome of a particular trial is not known in advance, and

3. The experiment can be repeated under identical conditions.

A sample space S is associated with a random experiment = set of all outcomes.

Example A.1 S for a coin toss is {H, T}, and for a die roll = {1, 2, 3, 4, 5, 6}. An event Ais a subset of S, and we are usually interested in whether A happens or not. e.g. A = odd

value on roll of a die; A =H on tossing a coin.

The random variable X is some function mapping outcomes in space S to the real line.

X : S R. e.g. with two coin tosses, S = {(H, T), (T, H), (H, H), (T, T)}, X = number ofheads turning up, X(H, T) = 1, X(H, H) = 2 etc.

Random variable Random variable: A real valued function assigning probabilities to different events in a

sample space. This function could have discrete or continuous values. For example,

1. Sum of two rolls of a die ranges from 2 122. The sequence HHTH of coin tosses is not a random variable. It must have a numerical

mapping.

Such variables are i.i.d.: in-

dependent and identically

distributed.

A function of a random variable defines another random variable. Measures such as mean and

Random variable = Nocontrol over outcome.

variance can be associated with a random variable. A random variable can be conditioned

on an event, or on another random variable. The notion of independence is important: the

random variable may be independent from an event or from another random variable.

Statistics refers to drawing inferences about specific phenomena (usually a random phe-

nomenon) on the basis of a limited sample size. We are interested in descriptive statistics: we

need measures of location and measures of spread.

Data scales are classified as categorical, ordinal or cardinal. Categorical scale: A label per

individual (win/lose etc.). If the label = name, we refer to the data as nominal data. Ordinal

data can be ranked, but no arithmetic transformation is meaningful. e.g. vision (= 20-20,

20-30), grades etc. Cardinal data is available in interval or ratio scales. Ratio scale: we knowWe mostly use cardinal data.

zero point (Eg. ratio of absolute temperatures is meaningful in K and not C. The same formass flow ratios.) Interval scale (arbitrary zero): F, C.

Cardinal data may be sorted and ranked: info may be lost in the process, but analysis

A1cSBN, IITB, 2012


2/52

A2 Introduction to Probability and Statistics

may be easier. Consider the following transformation: If 4 individuals A, B, C & D have

heights (in inches) 72, 60, 68 and 76 respectively, they may be transformed to ranks: 2, 4,

3 & 1. The height difference between B & A = difference between B and C, and thereforetransformations may lead to incorrect inferences. Ordinal data are protected by monotonic

transforms: If x > y , then log(x) > log(y); however |x y| = | log(x) log(y)| in general.

A.2 Discrete random variables

Measures of location10 5 0 5 10

0.0

0.1

0.2

0

.3

0.4

x

Density

yi == xi yi == xi ++ c

I II

Fig. A.1: Translating variables

on a linear scale.

Mean

The sample mean is x and the population mean is .

x =1

n

n

i=1

xi =1

N

N

i=1

xi (A.1)

where N = size of total population and is usually unknown. The arithmetic mean can be

translated or rescaled.

Translation: If c is added to every xi, then if yi = xi + c, then y = x + c (Fig. A.1).

Scaling: If yi = cxi then y = cx.

Median

For n observations, the sample median is

(n + 1)/2th largest observation for odd n,

average of n/2th and (n/2) + 1th largest observations for even n.

Mode

Most frequently occurring value. Can be problematic if each xi is unique.

Geometric mean

xG = (ni=1 xi)

1/nand on a log scale, antilog[(

ni=1 log xi)/n]

Harmonic mean

1/xH = (1/n)ni=1(1/xi). For positive data, xH < xG < x.

Display of data

Pareto diagram

Dot diagram

2 0 2 4 6

2

4

6

8

10

value

frequency

Fig. A.2: The dot diagram.Given n = {3, 6, 2, 4, 7, 4, 3} the dot diagram is shown in Figure A.2. This simple plot can beuseful if there exists lots of data. From a visual inspection of the plot, 2 could be consideredan outlier. Another advantage of these plots is that they can be superimposed.

Frequency Distributions

x

Frequency

2 3 4 5 6 7 8 9

0

1

2

3

4

5

Fig. A.3: A frequency distribu-

tion.

A frequency distribution is shown as a histogram in Figure A.3. Frequencies for each class/category

are tallied (sometimes using the # notation). Endpoints can be conveniently represented us-

ing bracket notation: in [1, 10), 1 is included and 10 is not. In [10, 19) 10 is included whentallying. Grouping together data loses information (in the form of specific individual values).

Cumulative distributions can also be generated from the frequency curves. If the height of a


3/52

A3 Discrete random variables

rectangle in a histogram is equal to the relative frequency/width ratio, then total area = 1 we have a density histogram.

Stem and leaf displays

1

2

3

4

5

2 7 5

9 1 5 4 8

4 9 2 4

4 8 2

3

i.e.

12 17 15

29 21 25 24 28

34 39 32 34

44 48 42

53

Measures of spread

Range

The range is the difference between largest and smallest observations in a sample set. It is

easy to compute and is sensitive to outliers.

Quantiles/percentiles

The pth percentile is that threshold such that p % of observations are at or below this value.

Therefore it is the

(k + 1)th largest sample point if np/100 = integer. [k = largest integer less than np/100]. Average of (np/100)th and (np/100 + 1)th values if np/100 = integer. For 10th percentile

p/100 = 0.1

The median = 50th percentile. The layout of quantities which calculation of percentiles

requires gives information about spread/shape of distribution. However data must be sorted.

Let Q1= 25th percentile, Q2= 50

th percentile = median, and Q3 = 75th percentile. Then

2

3

4

5

6

7

8

9

Q1

Q2

Q3

Fig. A.4: A box plot. for the pth percentile, (1) order the n observations (from small to large), (2) determine np/100,

(3) if np/100 = integer, round up, (4) if np/100 = integer = k, average the kth and (k + 1)thvalues.

The range = (min value - max value) was strongly influenced by outliers. The interquartile

range = Q3 Q1 = middle half of data and is less affected by outliers. The boxplot (Fig. A.4)is a graphical depiction of these ranges. A box is drawn from Q1 to Q3, a median is drawn as

a vertical line in the box, and outer lines are drawn either up to the outermost points, or to

1.5(box width) = 1.5(interquartile range).

Deviation

The sum of deviations from the mean is d = ni=1

(xi

x)/n = 0 and is not useful. The

mean deviation isni=1 |xi x|/n. This metric does measure spread, but does not accurately

reflect a bell shaped distribution.

Variance and Standard deviation

A variance may be defined as

S2 =

ni=1(xi x)2

n 1 2 =

Ni=1(xi )2

N(A.2)

where N denotes the size of a large population. The sample variance S2 uses (n 1) insteadof n in the denominator. (There are only (n 1) independent deviations if x is given, with

their sum being constrained to be 0). This definition of S2

with (n 1) is used to betterrepresent the population variance 2. A large S2 implies large variability. S2 is always > 0.

The definition of S2 above requires that x be computed first and then S2; two passes of the

10 5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

Density

yi == xi

yi == cxiIII

Fig. A.5: Effect of scaling on

variance.


4/52


data are required. A more convenient form (given below) permits simultaneous computation

of

x2i and

xi, and thus requires one pass.

S2 =

ni=1(xi x)2

n 1 =ni=1 x

2i (

xi)

2/n

n 1 (A.3)

For translated data, yi = xi + c (i = 1 to n) y = x + c and S2

y = S

2

x. For scaled samples,yi = cxi for i = 1, ...,n, c > 0 S2y = c2S2x (Fig. A.5).

Coefficient of variation

CV = 100%S

x(A.4)

This metric is dimensionless, hence one can discuss S relative to the means magnitude.

This is useful when comparing different sets of samples with different means, with larger

means usually having higher variability.

Indices of diversity

Such indices are used for nominal scaled data where mean/median do not make sense e.g. todescribe diversity among groups. Uncertainty is a measure of diversity. For k = categories

More on uncertainty later.

For now, notice the connec-

tion to entropy!

and pi = fi/n = proportion of observations in a category

H = ki=1

pi logpi =n log n ki=1 fi log f

n(A.5)

Hmax = log k. Shannons index J = H/Hmax.

Probability mass function (probability distribution)

The assignment of a probability to the value r that a random variable X takes is represented

as P rX(X = r) Eg. Two coin tosses X = number of heads.

P rX(x) =

1/2 if x = 1

1/4 if x = 0 or x = 2

0 otherwise

Obviously 0 P r(X = r) 1 and rSP(X = r) = 1.Notation: we will use P(X =

r) instead of P rX(X = r)A probability distribution is a model based on infinite sampling, identifying the fraction of

outcomes/observations in a sample that should be allocated to a specific value. Consider a

series of 10 coin tosses where the outcome = 4 heads. A probability distribution can tell us

how often 4 heads crop up when the 10 coin toss experiment is repeated 1000 times.

To establish the probability that a drug works on {0, 1, 2, 3, 4} out of 4 patients treated,theoretical calculations (from some distribution) are compared with some frequency counts

generated from 100 samples.

r P(X = r) Frequency distribution (100 samples)

0 0.008 0.000 [0/100]

1 0.076 0.090 [9/[100]

2 0.265 0.240 [24/100]

3 0.411 0.480 [48/100]

4 0.240 0.190 [19/100]

Compare the two distributions. Is clinical practice expectations from experience? Givendata and some understanding of the process, you can apply one out of many known PMFs

to generate the middle column above.


5/52


Expectation of a discrete random variable

The expectation of X is denoted

E[X] =R

i=1

xiP(xi) (A.6)

where xi = value i of the random variable. e.g. probability that a drug works on {0, 1, 2, 3, 4}out of four patients. E[X] = 0(0.008) + 1(0.076) + 2(0.265) + 3(0.411) + 4(0.240) = 2.80.Therefore 2.8 out of 4 patients would be cured, on average.

r P(X = r)0 0.008

1 0.076

2 0.265

3 0.411

4 0.240

1.000

For the case of two coin tosses, if X = number of heads, then E[X] = 0(1/4) + 1(1/2) +2(1/4) = 1. If the probability of a head on a coin toss was 3 /4

P(X = k) =

(1/4)2 = 1/16 k = 0 Binomial random variable

2 1/4 3/4 k = 1 n = 2, p = 3/4(3/4)2 k = 2

E[X] = 0(1/4)2 + 1(2

1/4

3/4)2 + 2(3/4)2 = 24/16 = 1.5

Example A.2 Consider a game where 2 questions are to be answered. You have to decideExpectations can be com-

pared.which question to be answer first. Question 1 may be correctly answered with probability 0.8

in which case you win Rs. 100. Question 2 may be correctly answered with probability 0.5

in which case you win Rs. 200. If the first question is incorrectly answered, the game ends.

What should you do to maximize your prize money expectation?

Soln:

Notice a tradeoff: Asking the more valuable question first runs the risk of not getting to

answer the other question. If total prize money = X = random var, then E[X] assuming Q1is asked first is P MF[X] : pX(X = 0) = 0.2, pX(X = 100) = pX(X = 300) = 0.8 0.5.Therefore E[X] = 0 0.2 + 100 0.8 0.5 + 300 0.8 0.5 = Rs. 160.

E[X] assuming Q2 is asked first is P M F[X] : pX(X = 0) = 0.5, pX(X = 100) = 0.5 0.2, pX(X = 300) = 0.5 0.8 hence E[X] = 0 0.5 + 200 0.5 0.2 + 300 0.5 0.8 =Rs.140. Hence ask question 1 first.

Example A.3 The weather is good with a probability of 0.6. You walk 2 km to class at aExpectation comparisons

can be misleading!speed V = 5 kph or you drive 2 km to class at a speed V = 30 kph. If the weather is good

you walk. what is the mean of the time T to get to class? What is the mean of velocity V?

Soln:

PMF of T = PT(t) = 0.6 if t = 2/5 hrs0.4 if t = 2/30 hrs

Therefore E[T] = 0.6 2/5 + 0.4 2/30 = 4/15 hrs. E[v] = 0.6 5 + 0.4 30 = 15 kph. NoteT = 2/V. However E[T] =E[2/V] = 2/E[V].

that E[T] = 4/15 = 2 km/E[v] = 2/15.

Variance of a discrete random variable

The variance of a discrete random variable is defined as

Var [X] = 2X =

Ri=1

(xi )2P(X = xi) (A.7)

The standard deviation SD(X) =Var [X] = and has the same units as the mean (and X).Using the E[] notation,

Var [X] = 2X = E

(X E[X])2 = EX2 (E[X])2 = EX2 2 (A.8)


6/52


Moments

k denotes the kth moment about the origin

=

ixki PX(xi) (A.9)

Hence the mean = = E[X] =i xiP(xi) = 1st moment about origin. k is the kth momentabout mean.

k =i

(xi )kPX(xi) (A.10)

Variance = 2nd moment about mean = 2. Then 2 =

2 23/3 = skewness (degree of asymmetry).

33

=E(X E[X])3

3(A.11)

Skew to right (+ve skew) or left (ve skew). The mean is on same side of mode as longer tail.An alternate measure of skew is (mean - mode)/, but to avoid using mode, use skewness =3(mean - median)/.

Kurtosis (peakedness) is

44

=E(X E[X])4

4(A.12)

A moment generating function of X is defined asIf t is replaced with syou have the Laplace trans-

form, and if replaced with it,

the Fourier transform. Its a

small world...

M(t) = MX(t) = E

etX

= E

1 + tX+t2x2

2!+ ...

(A.13)

Then n = E[Xn] is the nth moment of X and is equal to M(n)(0) (from a comparison witha Taylors expansion of M(t)), the nth derivative of M, evaluated at t = 0.

M(t) = 1 + 1t +2t2

2!+ ... +

ntnn!

+ ...

If y =ni=1 xi, then

Like a Binomial (several

coin tosses) from several iid

Bernoullis.

My(t) = E

ety

= Eetx1etx2 ...etxn = Eetx1 Eetx2 ...Eetxn= Mx1(t)Mx2(t)...Mxn(t) =

ni=1

Mxi(t) (A.14)

Chebyshevs theorem

If a probability distribution has mean and standard deviation , the probability of getting

a value deviating from by at least k is at most 1/k2. We can talk in terms of k, for

example 2, 3 etc.

P(|X E[X]| k) 1k2

(A.15)

This holds for any probability distribution!2 0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

x

Den

sity

R2R1 R3

KK

Fig. A.6: Proof of Chebyshevs

theorem.

Proof:

2X =Ri=1

(xi )2P(X = xi)

Consider three regions R1, R2 and R3 (see Fig. A.6) where R1 : x k, R2 : k 1}. Then if X = 1, E[X|X = 1] = 1 (we have success on the first try). Ifthe first try fails (X > 1), one try is wasted, but we are back where we started expectednumber of remaining tries is still E[X] and therefore E[X|X > 1] = 1 + E[X]

E[X] = P(X = 1)E[X|X = 1] + P(X > 1)E[X|X > 1]= p + (1 p)(1 + E[X]) = 1

p(on rearranging)

Similarly EX2|X = 1 = 1 which implies thatEX2|X > 1 = E(1 + E[X])2 = 1 + 2E[X] + EX2

Therefore,

EX2 = p 1 (1 p)(1 + 2E[X] + EX2)=

1 + 2(1 p)E[X]p

=2

p2 1

p

Var [X] = EX2 (E[X])2 = 2p2

1p

1p2

=1 p

p2

Multiple random variables: Joint PMFS

Given two discrete random variables X and Y in the same experiment, the joint PMF of X

and Y is PX,Y(X = x, Y = y) for all pairs of (x, y) that X and Y can take.

PX(x) =y

PX,Y(x, y) PY(y) =x

PX,Y(x, y) (A.25)


11/52


PX(x) and PY(y) are called marginal PMFs. If Z = g(X, Y), PZ(z) =

PX,Y(x, y) where

the summation is over all (x, y).

PZ(z) =

{(x,y)|g(x,y)=z}PX,Y(x, y) (A.26)

E[g(x, y)] =x,y

g(x, y)PX,Y(x, y) (A.27)

For example, E[aX+ bY + c] = aE[X] + bE[Y] + c.For three variables:

PX,Y(x, y) =z

PX,Y.Z(x,y,z) (A.28)

PX(x) =y

z

PX,Y.Z(x,y,z) (A.29)

E[g(X , Y , Z )] =x,y,z

g(x,y,z)PX,Y.Z(x,y,z) (A.30)

For n random variables X1, X2...Xn and scalars a1, a2...anWe saw this before: a bino-

mial is constructed from sev-

eral Bernoulli variables.

E[a1X1 + a2X2 + ... + anXn] = a1E[X1] + a2E[X2] + ... + anE[Xn]For example,

E[aX+ bY + cZ+ d] = aE[X] + bE[Y] + cE[Z] + d

Conditional PMFs

Given that event A has occurred, we can use conditional probability: this implies a new

sample space where A is known to have happened. The conditional PMF of X, given that

event A occurred with P(A) > 0 is

PX|A(x) = P(X = x|A) = P({X = x} A)P(A)

(A.31)

x

P({X = x} A) = P(A) (A.32)

Events {X = x} A are disjoint for different values of x and hence PX|A = 1 PX|A is aPMF.

Example A.5 X = roll of a die and A = event that roll is an even number

PX|A = P(X = x|roll is even) P(X = x and X is even)P(roll is even) =

1/3 if x = 2, 4, 60 otherwise

Conditional PMF of X, given random variable Y = y is used when one random variable is

conditioned on another.

PX|Y = P(X = x|Y = y) = P(X = x, Y = y)P(Y = y) =PX,Y(x, y)

PY(y)

Note thatx PX|Y(x|y) = 1 for normalization. PX,Y(x, y) = PY(y)PX|Y(x|y) = PX(x)PY|X(y|x)

we can compute joint PMFs from conditionals using a sequential approach.

Example A.6 Consider four rolls of a six sided die. X = number of 1s and Y = numberof 2s obtained. What is PX,Y(x, y)?


12/52


Soln:

The marginal PMF PY is a binomial.

PY(y) =4Cy(1/6)

y(5/6)4y, y = 0, 1, 2, 3, 4

To get conditional PMFs PX|Y(x

|y) given that Y = y, X = number of 1s in the 4

y rolls

remaining, values 1, 3, 4, 5, 6 occur with equal probability. ThereforePY|X(y|x) =4y Cx(1/5)x(4/5)4yx for all x, y = 0, 1, 2, 3, 4 and 0 x + y 4

For x, y such that 0 x + y 4

PX,Y(x, y) = PY(y)PX|Y(x|y)=4Cy(1/6)

y(5/6)4y 4yCx(1/5)x(4/5)4yx

For other (x, y), PX,Y(x, y) = 0

Marginal PMFs can therefore be computed from conditional PMFs:This is the total probability

theorem in a different nota-

tion.

PX(x) =y

PX,Y(x, y) =y

PY(Y = y)PX|Y(x|y)

Independence

Random variable X is independent of event A if P(X = x and A) = P(X = x)P(A) =

PX(x)P(A) for all x. As long as P(A) > 0, since PX|A(x) = P(X = x and A)/P(A) thisimplies that PX|A(x) = PX(x) for all x.

Random variables X and Y are independent ifThe value of Y(= y) has no

bearing on the value of X(=

x).

PX,Y(x, y) = PX(x)PY(y) (A.33)

for all x, y PX|Y(x, y) = PX(x) for all y with PY(y) > 0, and all x.Random variables X and Y are conditionally independent given an event A if

p(A) > 0.P(X = x, Y = y|A) = P(X = x|A)P(Y = y|A) (A.34)

for all x and y . PX|Y(x|y) = PX|A(x) for all x and y such that PY|A(y) > 0.For independent variables,

Independence implies that

E[XY ] = E[X] E[Y]. E[XY ] =x

y

xyPX,Y(x, y) =x

y

xyPX(x)PY(y)

=x xP

X(x)y yP

Y(y) = E[X] E[Y] (A.35)Similarly

E[g(X)h(Y)] = E[g(X)] E[h(Y)] (A.36)

For two independent random variables X and Y,Expectations of combined

variables always add up.

Variances add up only if X

and Y are independent.

E[X+ Y] = E[X] + E[Y] (A.37)

Var [Z] = E(X + Y E[X+ Y])2 = E(X+ Y E[X] E[Y])2= E((X E[X]) + (Y E[Y]))

2

= E(X E[X])2+ E(Y E[Y])2+ 2E[(X E[X])](Y E[Y])]= E(X E[X])2+ E(Y E[Y])2 = Var [X] + Var [Y] (A.38)


13/52


Independence for several random variables

Independence PX,Y,Z(x,y,z) = PX(x)PY(y)PZ(z) for all x,y,z etc. Obviously g(X, Y)andh(Z) are independent. For n independent variables X1, X2...Xn

Var [X1 + X2 + ...Xn] = Var [X1] + Var [X2] + ... + Var [Xn] (A.39)

Example A.7 Variance of the binomial: Given n independent coin tosses , with p =

probability of heads. For each i let Xi = Bernoulli random variable (= 1 if heads, = 0 if tails)

X = X1 + X2 + ...Xn. Then X is a binomial random variable. Coin tosses are independent X1, X2....Xn are independent variables and

Var [X] =ni=1

Var [Xi] = np(1 p)

Covariance and Correlation

Covariance of X & Y (random vars) = cov [X, Y]

cov[X, Y] = E[(X E[X])(Y E[Y])] = E[XY] E[X] E[Y] (A.40)When cov [X, Y] = 0, X & Y are uncorrelated. If X & Y are independent,

If X & Y are independent,

they are also uncorrelated. E[XY ] = E[X] E[Y] cov[X, Y] = 0

The reverse is not true! Be-

ware!Example A.8 Let (X, Y) take values (1, 0), (0, 1), (1, 0), (0, 1) each with probability1/4 i.e. marginal probability of X & Y are symmetric around 0. So E[X] = E[Y] = 0. For all

possible values of (x, y), either x or y = 0.Uncorrelated variables need

not be independent! XY = 0 E[XY ] = 0 cov [X, Y] = E[(X E[X])(Y E[X])] = E[XY ] = 0 X & Y are uncorrelated. However X & Y are not independent: nonzero X implies Y = 0etc.

Correlation Coefficient of X & Y

Let X & Y = random variables with nonzero variances. Correlation coefficient = normalizedNote that 1 XY 1. version of covariance =

xy =cov [X, Y]

Var [X]Var[Y](A.41)

Example A.9 Consider n independent tosses of a biased coin (head = p). Let X = # of

heads & Y = # of tails. Are X, Y correlated?

Soln:

For all (x, y), x + y = n E[X] + E[Y] = n X E[X] = (Y E[Y])cov [X, Y] = E[(X E[X])(Y E[X])] = E(X E[X])2 = Var [X]

Note that Var [Y] = Var [n X] = Var [X] = Var [X].

(X, Y) = cov [X, Y]Var [X]Var[Y]

= Var [X]Var [X]Var[Y]

= 1


14/52


Correlation

For the sum of several, not necessarily independent variablesGeneral case.

Var

ni=1

ciXi

=

ni=1

c2iVar [Xi] + 2ni=1

nj=1

cicjcov [XiXj ] , i < j (A.42)

Var [c1X1 + c2X2] = c21Var [X1] + c

22Var [X2] + 2c1c2cov[X1X2]

= c21Var [X1] + c22Var [X2]

+2c1c2

Var [X1]Var[X2](X1, X2)

A.3 Continuous probability distributions

Continuous vs Discrete: Continuous distribs are more fine grained, may be more accurate and

permit analysis by calculus. A random variable X is continuous if its probability distribution

can be defined by a non negative function fX of X such that

P(X) =

fX(x)dx (A.43)

i.e P(X is between a and b) =ba

fX(x)dx = area under the curve from a to b. Total area

under the curve = 1. fX(x) is a probability density function (PDF). Intuitively, PDF of X is

large for high probability and low for low probability. For a single value of X, P(X = a) =aa

fX(x)dx = 0. So including endpoints does not matter

P(a X b) = P(a < X < b) = P(a X < b) = P(a < X b)

For fX to be a valid PDF, it must befX(x) is not the probability

of an event: it can even be

greater than one.

1 Positive for all x, fX(x) 0 for all x2 fX(x)dx = P( < X < ) = 1

P([x, x + x]) = x+xx

fX(t)dt fX(x)x where fX(x) = probability mass per unitlength near x.

Expectation and Moments for continuous RVs

E[X] =

xfX(x)dx (A.44)

The nthmoment is E[Xn]. The variance isE[g(x)] = g(x)fX(x)dx,

g(x) may be continuous or

discrete.

Var [X] = E(X E[X])2 = EX2 (E[X])2 =

(x E[X])2fX(x)dx > 0

If Y = aX+ b, E[Y] = aE[X] + b and Var [Y] = a2Var [X]

Discrete Uniform law: Sam-

ple space had finite number

of equally likely outcomes.

For discrete variables, we

count the number of out-comes concerned with an

event. For continuous vari-

ables we compute the length

of a subset of the real line.

Continuous uniform random variable

The PDF of a continuous uniform random variable is

fX(x) = c, a x b0, otherwise

(A.45)

where c must be > 0. For f to be a PDF, 1 =ba

c dz = cba

dz = c(b a) c = 1/(b a)


15/52

A15 Continuous probability distributions

Piecewise constant PDF, general form:

fx(x) =

ci if ai x ai+1, i = 1, 2,.....,n 10 otherwise

1 =ana1 fX(x)dx =

n1i=1

anai cidx =

n1i=1 c

i(ai1 ai)

fX(x) =

1/2

x if 0 < x 1

0 otherwise

fX(x) =

10

1

2

xdx =

x10

= 1

Mean and variance of the uniform random variable:

0 5 10 20 30

0.

0

0.

4

0.

8

x

fx C1

C2C3

Fig. A.11: The piecewise con-

stant distribution.E[X] =

xfX(x)dx = b

a

x1

b

a

dx =1

b

a

12

x2b

a

=1

b a b2 a2

2=

a + b

2

The PDF is symmetric around (a + b)/2)

EX2 = ba

x21

b a dx =1

b a x3

3|ba =

b3 a33(b a) =

a2 + ab + b2

3

Var [X] = EX2 (E[X])2 = (a2 + ab + b2)3

a + b

2

2=

(b a)212

Example A.10 Let [a, b] = [0, 1] and let g(x) =

1 if x 1/32 if x > 1/3

1. Discrete: Y = g(X) has PMF

PY(1) = P(X 1/3) = 1/3PY(2) = P(X > 1/3) = 2/3

E[Y] = 13

1 + 23

2 = 53

2. Continuous:

E[Y] =10

g(x)fx(x)dx =

1/30

1dx +

2/31/3

2dx =5

3

Exponential continuous distribution

PDF of X is

fX(x) =

ex if x 0

0 otherwise(A.46)

Is it a PDF?

fX(x)dx =

0

exdx = ex0

= 1

For any a 00 20 40 60 80

0.

00

0.

02

0.

04

x

fx

Fig. A.12: The exponential dis-tribution with = 0.05.

P(X a) =

aexdx = exa = ea

Examples: Time till a light bulb burns out. Time till an accident.


16/52


Mean

E[X] =0

x(ex)dx = (xex)|0 +0

exdx = 0 ex|0

=

1

Variance

EX2 = 0

x2(ex)dx = (x2ex)|0 + 0

2xexdx = 0 + 2E[X] = 22

Var [X] =2

2

1

2=

1

2

Example A.11 The sky is falling... Meteorites land in a certain area at an average of

10 days. What is the probability that the first meteorite lands between 6 AM and 6 PM of

the first day given that it is now midnight.

Soln:

Let X = elapsed time till strike (in days) = exponential variable

mean = 1/ = 10 days

= 1/10.

P

1

4 X 3

4

= P

X 1

4

P

X 34

= e(1/4) e(3/4) = e1/40 e3/40 = 0.047

Probability that meteorite lands between 6 AM and 6 PM of some day:

For kth day, this time frame is (k 3/4) X (k 1/4)

=k=1

P

k 3

4

X

k 1

4

=

k=1

P

X

k 3

4

k=1

P

X k 1

4

= k=1

(e(4k3)/40 e(4k1)/40)

Cumulative distribution function (CDF)

CDF of X

FX = P(X x) =

kx PX(k) X : Discretex fX(t)dt X : Continuous

(A.47)

For the discrete form the CDF will look like a staircase.

Some properties of CDFs FX(x)

FX(x) is monotonically nondecreasing: if x y, then FX(x) FX(y). FX(x) 0 as x , and FX(x) 1 as x +. If X is discrete, FX is piecewise constant (staircase). If X is continuous, FX is continuous. If X is discrete:

CDF from PMF:

FX(k) =k

i=PX(i)

PMF from CDF: (for all k)

PX(k) = P(X k) P(X k 1) = FX(k) FX(k 1)


17/52


If X is continuous: CDF from PMF:

FX(x) =

x

fX(t)dt

PMF from CDF:

fX(x) =dFX

dx

x

Geometric vs. Exponential CDFs

Let X = geometric random var. with parameter p. For example, X = number of trials before

first success with probability of success per trial = p.

P(X = k) = p(1 p)k1, for k = 1, 2, .....

FgeoX (n) =

nk=1

p(1 p)k1 = p 1 (1 p)n

1 (1 p) = 1 (1 p)n, n = 1, 2...

Let Y = exponential random var, > 0

2 4 6 8 10

0.

0

0.

2

0.

4

x

fx

Fig. A.13: Geometric vs. expo-

nential distributions (p = 0.15

and = 0.185).

FexpY (x) =x0

etdt = et|x0 = 1 ex, for x > 0

Comparison: Let

= ln(1 p)

e = 1 p Exponential and geometric CDFs match for all x = n, n = 1, 2,...

Fexp(n) = Fgeo(n), n = 1, 2...

Normal (Gaussian) random variables

Most useful distribution: use it to approximate other distributions

PDF of X = fX(x) =12

e(x)2/22 < x < (A.48)

for some parameters & ( > 0).

Normalization confirms that it is a PDF:

fX(x)dx =12

e(x)2/22dx = 12 0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

x

Density

ba

fx

Fig. A.14: Normal distribu-

tion.

E[X] = mean = (from symmetry arguments alone). Var [X] =

(x E[X])2fX(x)dx =

(x )2 12

e(x)2/22dx

Let z = (x )/

Var [x] =22

z2ez2/2dz =

22

(zez2/2)

+

22

ez2/2dz

=22

ez2/2dz = 2

X = normal var with and 2

= (mean,var). Height of N(, ) = 1/2 and hence height 1/. If Y = aX + b then E[Y] = aE[X] + b and Var [Y] = a22 where X and Y areindependent.


18/52


Standard Normal Distribution

A normal distribution of Y with mean = 0 and 2 = 1; N(, 2) = N(0, 1) = standardnormal

CDF of Y = (y) = P(Y y) = P(Y < y) = 12

y

et2/2dt (A.49)

(y) is available as a table. If the table only gives the value of (y)for y 0, use symmetry.

Example A.12 (0.5) = P(Y 0.5) = P(Y 0.5) = 1 P(Y 0.5) = 1 (0.5) =1 0.6915 = 0.3085

Given a normal random var

X with mean and variance

2, use

Z =(x )

Then

E[Z] = (E[X] )

= 0

Var [Z] =Var [X]

2= 1

Z = standard normal.

Example A.13 Average rain at a spot = normal distribution, mean = = 60 mm/yr and

= 20 mm/yr. What is the probability that we get at least 80 mm/yr?

Soln:Let X = rainfall/yr. Then Z = X / = (X 60)/20. So

P(X 80) = P(Z 80 6020

) = P(Z 1) = 1 (1) = 1 0.8413 = 0.1587

In general

P(X x) = P

X

x

= P

Z x

=

x

where Z is standard normal. 100 uth percentile ofN(0, 1) = Zu and P(X < Zu) = u whereX

N(0, 1)

Approximating a binomial with a normal

Binomial: Probability of k successes in n trials, each trial has success probability of p. The

binomial PMF is painful to handle for large n, so we approximate. E[X] = np and Var [X] =

npq the equivalent Normal distribution is N(np, npq).Consider a binomial with mean = np and variance = npq, with a range of interest from

a X b. It may be approximated with a normal, N(np, npq) from a 1/2 X b + 1/2.For the special case: a = b, P(X = a) = area under normal curve from a 1/2 to a + 1/2.

At the extremes:

Binomial Normal

P(X = 0) area to the left of 1/2

P(X = n) area to the right of n

1/2Use when npq

5, n, p, q

all moderate.

Approximating the Poisson with a normal

Poisson is difficult to use for large . PX(X = x) N(, ) where for the Poisson, mean =var = . For the Normal, we take the area from x 1/2 to x + 1/2 or the area to left of 1/2for x = 0. Use this approximation for 10.

Summary of approximations

1. Approximating Binomial with Normal:

Rule of Thumb: npq 5, n, p, q all moderately valued.2. Approximating Poisson with Normal:

10

3. Approximating Binomial with Poisson: n 100, p 0.01(Obviously, both Poisson and Normal may be used for n 1000, p = 0.01)


19/52


Beta distribution

The Beta distribution has as its pdfUnlike the binomial, a and b

need not be integers. PMF(x) =(a + b)

(a)(b)xa1(1 x)b1 0 x 1 (A.50)

(a + 1) = a(a) in general and when a is an integer, using this relationship recursively givesWhen a = b = 1, Beta(x;1,1)

= Uniform distribution.

us (a + 1) = a!. It is important to appreciate that by varying a and b, the pdf could attain

various shapes. This is particularly useful when we look for a pdf of an appropriate shape.

E[x] =10

xf(x)dx =(a + b)

(a)(b)

10

xa(1 x)b1dx

=(a + b)

(a)(b)

(a + 1)(b)

(a + b + 1)

10

(a + b + 1)

(a + 1)(b)xa(1 x)b1dx

=(a + b)

(a)(b)

(a + 1)(b)

(a + b + 1)=

a

a + b

Similarly

Ex2 = a(a + 1)(a + b + 1)(a + b)

Var [x] =ab

(a + b)2(a + b + 1)


20/52

BAppendix B Estimation and

Hypothesis Testing

B.1 Samples and statistics

Theres no sense in being

precise when you dont even

know what youre talking

about. John von Neumann

We sample to draw some conclusions about an entire population.Examples: Exit polls, census.

We assume that the population can be described by a probability distribution. We assume that the samples give us measurements following this distribution,

We assume that the samples are randomly chosen. It is very difficult to verify that a sample

is really a random sample. The population is assumed to be (effectively) infinite.

Sampling theory must be used if the population is actually finite. Simple random sample: each member of a random sample is chosen independently and has

the same (nonzero) probability of being chosen.

Cluster sampling: If the entire population cannot be uniformly sampled, divide it intoclusters and sample some clusters.

Clinical Trials and Randomization

Block randomization: one block gets the treatment and another the placebo. Blind trials: Patients do not know what they get. Double blind trials: both doctors and patients do not know what they get.Distributions and Estimation

We wish to infer the parameters of a population distribution. mean, variance, proportion etc.

We use a statistic derived from the sample data. Sample mean, sample variance etc.

A statistic is a random variable. What distribution does it follow?

What are its parameters?

Notation

Population parameters are typically in Greek. = population mean.

2 = population variance.

The corresponding statistic is not in Greek! x = sample mean.

s2 = sample variance.

We can also denote an estimate of the parameter as .

Regression:

You want: y = x + .

You get: y = ax + b.

B1cSBN, IITB, 2012


21/52

B2 Estimation and Hypothesis Testing

Why do we infer using the sample mean?

Consider a population with mean and variance 2. We have n measurements, x1, x2, . . .,

xn.

x =x1 + + xn

n(B.1)

Its all about expectationsE[xi] = Var [xi] = 2 (B.2)

E[x] = E

x1 + + xnn

=

E[x1] + + E[xn]n

=n

n= (B.3)

E[xi] = E[x] = (B.4)

x is an unbiased estimator of . For a statistic to be unbiased

E

= (B.5)

The population mean has other unbiased estimators such as the sample median, (min +

max)/2 and the trimmed mean. x is not a robust estimator as it is sensitive to outliers.

Nevertheless x is the preferred estimator of . x is the estimator with the smallest variance.

What is the variance of x?

We had Var [xi] = 2.

Var[x] = Var x1 + + xnn

=

1

n2(Var [x1 + + xn])

=1

n2(Var [x1] + + Var [xn]) Independence!

=1

n2n2 =

2

n(B.6)

xi or x?

E[xi] = Var [xi] = 2 (B.7)

E[x] = Var [x] = 2n

(B.8)

If x N(0, 1), then x N

0,1

n

The sample mean depends on n. Therefore we must sample more and gain accuracy!

Fig. B.1: The distribution of x

Intuitively, we desire replica-

tion of results.

The population size can matter

For an infinite population, Var [x] = 2/n. For a finite population of size N, we need a

correction factor.

More on this in a separate

handout.

Var [x] =2

n

N nN 1 (B.9)


22/52

B3 Samples and statistics

The Central Limit Theorem

Theorem B.1 The sum of a large number of independent random variables has a distribu-

tion that is approximately normal.

Let x1, . . ., xn be i.i.d. with mean and variance 2. For large n

(x1 + + xn) N(n, n2) (B.10)

P

(x1 + + xn) n

n

P(z < x) (B.11)

Consider the total obtained on rolling several (n) dice (Fig. B.2). As n increases, the distri-

bution changes to a Gaussian curve!

Fig. B.2: Rolling n dice.

CLT and the Binomial

Let X = x1 + + xn, where xi is a Bernoulli variable.E[xi] = p Var [xi] = p(1 p) (B.12)

E[X] = np Var [X] = np(1 p) (B.13)

CLT suggests that for large n

X npnp(1 p) N(0, 1) (B.14)

The Binomial can be approximated with a normal if np(1 p) 10. For p = 0.5, npq 10implies n 40! In general, the normal approximation may be applied if n 30. So keep

Remember: even if an in-

dividual measurement does

not follow a Normal distri-

bution, the measured sample

mean does.

sampling!

Sample variance

The population variance is

2 =

Ni=1(xi )2

N(B.15)

The sample variance is

s2 =ni=1(xi x)2

n 1 (B.16)

We had E[x] = . But is ES2 = 2? We compute ES2 as follows:Why n 1?

(xi x)2 = (xi + x)2= (xi )2 + ( x)2 + 2(xi )( x)

ni=1

(xi x)2 =ni=1

(xi )2 +ni=1

( x)2 + 2ni=1

(xi )( x) (B.17)

The middle term is ...

ni=1

( x)2 = n( x)2


23/52


The last term is ...

2

ni=1

(xi )( x) = 2(x )ni=1

(xi ) = 2n(x )2

n

i=1

(xi

x)2 =n

i=1

(xi

)2 + n(

x)2

2n(x

)2

=ni=1

(xi )2 n(x )2ni=1(xi x)2

n=

ni=1(xi )2

n (x )2

Taking the expectation on each side

En

i=1(xi x)2n

= En

i=1(xi )2n

E(x )2 (B.18)

But,

En

i=1(xi )2n

= 2 (B.19)

E(x )2 = Var [x] = 2n

(B.20)

This implies

En

i=1(xi x)2n

= 2

2

n=

(n 1)2n

(B.21)

We have a biased estimator! This means that

E

ni=1(xi x)2n 1

= 2 (B.22)

Therefore we define an unbiased estimator s2 as

s2 =

ni=1(xi x)2

n 1 (B.23)

We lost a degree of freedom.

Where? 2 =

Ni=1(xi )2

ns2 =

ni=1(xi x)2

n 1

B.2 Point and Interval Estmation

Point estimates

Given xi N(, 2), the preferred point estimate of is x. The preferred point estimate of2 is s2. We know how x behaves:

x N

,2

n

x /

n

N(0, 1) (B.24)

We do not know how s2

behaves yet. Similarly, intuition leads us to the best estimate ofthe binomial proportion p as p = X/n. The maximum likelihood estimation approach below

provides a method for the determination of parameter point estimates.


24/52

B5 Point and Interval Estmation

Optimal estimation approaches

The MLE of a parameter is that value of that maximizes the likelihood of seeing the

observed values, i.e. P(D|). Note that instead we could desire to maximize P(|D): i.e. wechoose the most probable parameter value given the data that we have already seen. This is

Maximum a posteriori (MAP) estimation. The MAP estimate is typically more difficult to

obtain because we need to rewrite it in terms of P(D|) using Bayes law, and p() needs to bespecified. Also note that if P() is effectively a constant (i.e. is a uniform RV), then MLE

= MAP. There are other approaches towards obtaining estimates of parameters including

the confusingly named Bayesian estimation (different from MAP), maximum entropy etc.

In general note that we want estimates with small variances, and hence minimum variance

estimates are popular. In the discussion that follows, we use almost always use MLE; we will

however encounter MAP at least once later.

MLE of the Binomial parameter p:

Let X = random variable describing a coin flip, X = 1 if heads, X = 0 if tails. For one coin

flip,

As before, the Bernoulli and

the Binomial.f(x) = px(1 p)1x (B.25)

For n coin flips,

f(x1, x2,...,xn) =

ni=1

pxi(1 p)1xi (B.26)

The log-likelihood may be written as

l(p|xi,...,xn) =ni=1

[xi lnp + (1 xi)ln(1 p)] (B.27)

The MLE of p = p and is obtained from

The answer here is obvious.MLE does this quite often:

it gives us intuitive answers.

However as the next exam-

ple shows, MLE could be bi-

ased.

d l(p)

dp= 0

xi

p

(1 xi)1 p = 0

which can be rearranged to give

p =

ni=1 xin

(B.28)

The MLE of the binomial proportion (X heads in n tosses) is

p =X

nand E[p] = p (B.29)

An MLE perspective on x and s2, for a Gaussian

Let D = {x1, . . . , xn} be a dataset of i.i.d. samples from N(, 2). Then the multivariatedensity function is

f(D) =ni=1

e(xi)2/22

2=

1

n(2)n/2exp

1

2

ni=1

xi

2(B.30)

f(D) is the likelihood of seeing the data. l = ln[f(D)] is the log likelihood of seeing the data.

Which estimates of and 2 would maximize the likelihood of seeing the data? Set

l

= 0

l

2= 0 (B.31)

and solve for and 2.

l

= 0

ni=1(xi )2

= 0 =

i xin

= x (B.32)


25/52


The MLE of is = x. The MLE of 2 however is a surprise:

l

2= 0 = n

22+

1

22ni=1(xi )2

2

1

22 ni=1(xi )2

2 =

1

24

ni=1

(xi )2

=1

24

ni=1

[(xi x) + (x )]2

=1

24

ni=1

(xi x)2 + n(x )2 + 2(x )ni=1

(xi x)

=1

24

nS2 + n(x )2 where S2 is defined as = ni=1(xi x)2n

l

2= 0 =

n

22+

1

22 ni=1(xi )2

20 = n

22+

n

24

S2 + (x )2But x = from before, and hence

The MLE of 2 is biased!

2ML = S2 =

ni=1(xi x)2

n(B.33)

Interval Estimation

We have point estimates of and 2. But remember that x and 2 are random variables. We

could use bounds: for example, x =

. should depend on 2.

Chebyshevs inequality for a random variable x

Theorem B.2

P(|xi | > k) 1k2

(B.34)

2 = E(x )2 = R1,R2,R3

(xi E[x])2p(xi)

R1

(xi E[x])2p(xi) +R3

(xi E[x])2p(xi)

R1

k22p(xi) +R3

k22p(xi)

1

k2R1

p(xi) +R3

p(xi)4 2 0 2 4 6

0.

0

0.

1

0.

2

0.

3

0.

4

k + k

Fig. B.3: Chebyshevs theorem P(|xi | > k) 1k2

This applies to ANY distribution. For k = 2,

Chebyshevs theorem suggests: P(|xi | > 2) 0.25.

For a Gaussian, P(

|xi

|> 2)

0.05.

Clearly, if the density function is known, we should be using it instead of Chebyshevs theorem,

to determine probabilities.

3 x 2 + + 2 + 3

68%

95%

99.7%

f (x)

i . .

Fig. B.4: Intervals on a Gaus-

sian


26/52


Chebyshevs inequality for the derived variable x

P

|x | > k

n

1

k2(B.35)

Let k/

n = = some tolerance.

P(|x | > ) 2

n2 P(|x | < ) 1

2

n2(B.36)

This is the law of large numbers: For a given tolerance, as n , then X .

Which estimate intervals do we want?

Chebyshevs inequality gives us a loose bound. We want better estimates. For a normal

distribution,

1. interval estimate of with 2 known,

2. interval estimate of with 2 unknown,

3. interval estimate of2 with unknown

For a binomial distribution,

4. interval estimate ofp

1. Normal: interval for , 2 known

We have n normally distributed points.

x N(, 2)

x N

,2

n

x /

n

N(0, 1)

Recall the z notation for a threshold:

P(z > Z) =

For a two-sided interval,

Fig. B.5: One sided interval

Fig. B.6: Two sided intervals

P

z/2 < x /n < z/2

= 1 (B.37)

P

1.96 < x

/

n< 1.96

= 0.95 (B.38)

We can rearrange the inequalities in

Pz/2 < x /n < z/2 = 1

to get

P

x z/2 n < < x + z/2

n

= 1 (B.39)

So what does a confidence interval x z/2 n mean?: The maximum error of the estimateis |X | < z/2/

n with probability 1 . What does a confidence interval mean for a

Frequentist? The two-sided confidence interval at level 100(1 ) was3 2 1 0 1 2 30.5

0.0

0.5

1.0

value

Probability

| |

| |

| |

| |

| |

Fig. B.7: Confidence intervals x z/2 n (B.40)

We can have lower and upper one sided intervals.x z

n,

(B.41)


27/52


, x + z

n

(B.42)

What should the sample size (n) be, if we desire x to approach to within a desired level of

confidence (i.e. given )? Rearrange and solve for n:

n =

z/2|x |2

(B.43)

Does the dependency of n on the various terms make sense? You need more samples if

1. you want x to come very close to ,

x

E = error = x

u = x + z/2 / n l = x z/2 / n

tFig. B.8 Intervals on the Gaussian

2. is large,

3. z/2 is increased (i.e. is decreased)

3 x 2 + + 2 + 3

68%

95%

99.7%

f (x)

i . .Fig. B.9: Effect of

2. Normal: interval for , 2 unknown

The two sided interval for , when 2 is known is x z/2/

n. What happens when 2 is

unavailable? We must estimate 2. Its point estimate is s2. However, theres a problem w.r.t.

interval estimation:

x /

n

follows the normal N(0, 1)

Butx s/

n

does not follow N(0, 1)

In fact, we need a new sampling distribution to describe how (x )/(s/n) varies. We needto acknowledge that our estimate s2 is a random variable (RV), with its own distribution

centered around mean 2. Therefore we have a range of values for s2 which could be used

to substitute for 2 in N(, 2) and therefore we must average across several Gaussians,which have the same mean but with different variances. The new distribution must have the

following properties:

The Gaussians are symmetric about : therefore a weighted average of Gaussians must be

symmetric about .

Tail: it should have a thick tail, for small n: greater variability than N. As n , we should be back to N(, 2). The distribution is a function of sample size n. We have a family of curves!

William Gosset: a chemist

at the Guinness breweries in

Dublin, came up with the

Students t-distribution!

We know ofN(x|, 2). But 2 is unknown, and if we estimate it, we end up with a distributionof variance values. It is easier to work with the precision :

=1

2(B.44)

The univariate Normal distribution is

f(x) =12

exp

1

2

x

2=

1/22

exp

2(x )2

(B.45)


28/52


We have a dependency on of the form

1/22

exp

2(x )2

The Gamma distribution is

Gam(x|, ) =

x

1

() exp(x) x > 0, > 0, > 0 (B.46)where

() =

0

eyy1dy (B.47)

() = ( 1)( 1). If is an integer, (n) = (n 1)!. For a Gamma variable, the = 1: (1) = 1.

The Gamma distribution be-

comes an exponential distri-

bution.

moment generating function is

(t) = Eetx = ()

0

etxexx1dx =

t

(B.48)

E[x] =

(0) =

(B.49)

Var [x] = =

(0) (E[x])2 = 2

(B.50)

The precision follows a Gamma distribution. We had

Fig. B.10: Gamma distribution

N(x|, 2) =N(, 1) (B.51)We want the average of several distributions:

EN(, 1) (B.52)

which is

0 Nx

|, 1Gam(|

, )d (B.53)

This averaged distribution, can be shown to be (details in another handout on Moodle)

+ 12

()

1

2

1/2 1 +

(x )22

12

(B.54)

This is the Students t-distribution, St(x|,,) where 2 signifies the degrees of freedom.

Fig. B.11: The t distribution

Fig. B.12: The t and N(0, 1)distributions

Notice the symmetry about 0 in Figs B.11 and B.12. Quantiles corresponding to the 95%

t0k,t k,t1 k, t=

i

Fig. B.13: Intervals on the t dis-

tribution

confidence interval for the t-distribution, as a function of the degrees of freedom, are shown

in Table B.1. For n 30, t N(0, 1). Now

t0.025,

4 2.776

9 2.262

x /

n

follows the N(0, 1) curve, but x s/

n

follows the tn1 curve

The interval for when 2 was known, was

x n

z/2

The interval for when 2 is unknown is

x sn

t/2,n1 (B.55)

Similarly, lower and upper one sided intervals arex t,n1 s

n,

(B.56)

, x + t,n1 s

n

(B.57)


29/52


Example B.3 A fuse manufacturer states that with an overload, fuses blow in 12.40 minutes

on average. As a test, take 20 sample fuses and subject them to an overload. A mean time of

10.63 and a standard deviation of 2.48 minutes is obtained. Is the manufacturer right?

Soln:

t = x s/n = 10.63 12.402.48/20 = 3.19

From a t table,

P(t19 > 2.861) = 0.005 P(t < 2.861) = 0.005Hence the manufacturer is wrong and the mean time is probably (with at least 99% confidence)

below 12.40 minutes.

3. Normal: interval for 2

If (x1, . . . , xn)

N(, 2). The point estimate of 2 is s2

s2 =

ni=1(xi x)2

n 1 and E

s2

= 2 (B.58)

We have

zi =(xi )

N(0, 1) (B.59)

Let

X = z21 + + z2n (B.60)X follows a 2 distribution with n degrees of freedom.

X 2n (B.61)The 2 with n degrees of freedom is related to the Gamma distribution. The 2 distribution0 5 10 15 20 25 x

k = 10

k = 5

k = 2

f(x)

i

Fig. B.14: The 2 distribution.has x always > 0 and is skewed to the right. We have

It is not symmetric. The skew

as n .

ni=1(xi )2

2 2n (B.62)

When we substitute with x ni=1(xi x)2

2 2n1 (B.63)

But

s2 =

ni=1(xi x)2

n 1(n 1)s2

2 2n1 s2

2

n 1 2n1 (B.64)

We compute intervals, using quantiles, as before. No symmetry we need to evaluate bothquantiles for a two sided interval.

(n 1)s22

2n1 (B.65)

P

(n 1)s22/2,n1

< 2 0

H1 : = 1 (= 0)

Do we need to pay attention to H1? Yes. H1 influences the interval estimate. To see that we

need to understand what means. But before that, why should H0 be the equality statement?Because the moment we say something like H0 : = 0 we have a specific distribution to

work with! The equality statement carries the burden of proof. The legal system prefers that4 2 0 2 4

0.0

0.

1

0.

2

0.

3

0.

4

Ho Accepted

0

Fig. B.16: Hypothesis accep-

tance

we try and disprove H0 : X is innocent. Typically, the accused is innocent (H0) until proven

guilty (H1) beyond reasonable doubt ().

What is ?

The error of the estimate is |x |.This maximum error is < z/2/

n with probability 1 There are 4 possible outcomes to

Fig. B.17: The error

a test of H0 vs. H1.

1 Accept H0 as true, when in fact it is true.

2 Accept H0 as true, when in fact it is not true (i.e. H1 is true).

3 Reject H0 as true, when in fact H0 is true.

4 Reject H0 as true, when in fact H0 is not true (i.e. H1 is true).

Truth

H0 H1

Decision H0 true, H0 accepted H1 true, H0 accepted

Taken H0 true, H1 accepted H1 true, H1 accepted

Type I error: H0 true, but H1 accepted = Remember that = signif-

icance level of the test ap-

plied.Where theres an ...

...theres a . Type II error: H1 true, but H0 accepted = . Assuming H1 : = 1, with

1 > 0, 1 is the power of the test. depends on 1. Increase |01| and 1 increases

H0 H1

2 2

0 1

Fig. B.18: and

powerful test. We accept H0 in the range

=|x 0|

/

n z/2 = 0 z/2 n x 0 + z/2

n

(B.85)

But is the area in this range, under H1. We need to convert the thresholds to standardized

coordinates first, this time under H1. Under H1,

x N

1,2

n

(B.86)


33/52


The location 0 z/2/

n becomes (under H1)

=

0 z/2/

n 1

/

n=

0 1/

n

z/2 (B.87)

Similarly, the other threshold is

Fig. B.19: The acceptance in-terval

=0

1

/n + z/2 (B.88)Therefore

= P

0 1/

n

z/2 z 0 1/n + z/2

=

0 1/

n

+ z/2

0 1/

n

z/2

(B.89)

If 1 > 0, then the second term 0.

0 1/

n

+ z/2

= P(z > z) = P(z < z) = (z) (B.90)

z (0 1)

n

+ z/2 (B.91)

n (z/2 + z)22

(1 0)2 (B.92)

The sample size depends on

1. |0 1|. As |0 1| increases, n decreases.

46

0.3

0.4

0.5

0.6

0.2

0.1

048 50 52 54 56

UnderH0: = 50 UnderH1: = 52

Probabilitydensity

x

i . .

46

0.3

0.4

0.5

0.6

0.2

0.1

048 50 52 54 56

UnderH0: = 50

UnderH1: = 50.5

Probabilitydensity

x

i . .

tFig. B.20 Effect of |0 1|

2. 2. As 2 increases, n increases.

4 2 0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Ho H

4 2 0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

Ho H

tFig. B.21 Effect of 2

3. . As decreases, z/2 increases, hence n increases.

4 2 0 2 4 6 80.1

0.0

0.1

0.2

0.3

0 1

4 2 0 2 4 6 80.1

0.0

0.1

0.2

0.3

0 1

4 2 0 2 4 6 80.1

0.0

0.1

0.2

0.3

0 1

tFig. B.22 Effect of

4. . As required power increases (1 increases), n increases.


34/52

B15 Hypothesis Testing

4 2 0 2 4 6 80.1

0.0

0.1

0.2

0.3

0 1

4 2 0 2 4 6 80.1

0.0

0.1

0.2

0.3

0 1

tFig. B.23 Effect of

One-sided vs. two-sided tests

Consider the null hypothesis H0 : = 0: When H0 : = 0 and H1 : > 0, we should

(a)

0

N(0,1)

z /2z /2 Z0

/2/2 Acceptance

region

Critical region

(c)

0

N(0,1)

z Z0

Acceptance

region

(b)

0

N(0,1)

z Z0

Critical region

Acceptance

region

Critical region

i . . ,

H1 : = 0 H1 : > 0 H1 : < 0

tFig. B.24 One-sided vs. two-sided tests

Which is more conservative:

a two-sided test or a one-

sided test?

accept H0 ifx 0/

n

z (B.93)

reject H0 ifx 0/

n

> z (B.94)

p-value

The p-value is the significance level when no decision can be made between accepting andrejecting H0. Find the probability of exceeding the test statistic (in absolute value) on a unit

normal.

p-value = p

|z| > |x 0|

/

n

(B.95)

4 2 0 2 4

0.

0

0.

1

0.

2

0.3

0.4

Fig. B.25: The p valuep-value and one-sided tests

For H0 : = 0 and H1 : = 0, we would reject if the test statistic, t is > z/2.

p-value = 2 p(z |t|) For H0 : 0 and H1 : > 0, we would reject if the test statistic, t is > z.

p-value = p(z t)

For H0 : 0 and H1 : < 0, we would reject if the test statistic, t is < z.

p-value = p(z t)

Why is it important to re-

port the p-value? Why not

just report the test result

(yes/no) based on the chosen

?

When you perform a hypothesis test, you should report

1. the significance level of the test that you chose,

2. AND the p-value associated with your test statistic.


35/52


A problem solving tip

For 1 > 0, and for a two-sided test,

0 1/

n

+ z/2

(B.96)

Observe that

= f(|0 1|, 2, n, ) (B.97)

The sample size n is

n (z/2 + z)22

(1 0)2 (B.98)

and depends on

n = f(|0 1|, 2, , ) (B.99)

Most problems require us to solve for one of the following in terms of the rest:

|0 1|, 2, , , n

, power and rejection of H0As we increase 1, decreases. The curve described by (1) is called an operating charac-

teristic, (OC). For a fixed value of , plot

(1) vs.|0 1|

/

n(B.100)

The Operating Characteristic curve for a two-sided test with = 0.05.

(1) vs.|0 1|

/

n

Fig. B.26: The OC curve.

One sided tests and

We had, when H0 : = 0 and H1 : > 0,

(1) = P (accepting H0 | H1 : = 1 is correct) (B.101)

We switch to the more general notation, replacing 1 with

beta() = P (accepting H0 | H1 is correct) (B.102)

4 2 0 2 4 6 80.

1

0.0

0.

1

0.2

0.

3

0 1

Fig. B.27: Finding .

() = P

x 0 + z

n

= P

x /

n

0 /

n

+ z

= P

z 0

/

n+ z

, z N(0, 1)

=

0 /

n

+ z

(B.103)

Compare this against for a two sided interval: z/2 has been replaced with , and 1 has

been replaced with the more generic . As its argument increases, increases. Therefore as

increases, () decreases. The larger is, the less likely it would be to conclude that

= 0, or for that matter 0.

(0) = 1


36/52

B17 Parametric hypothesis testing

H0 and H1To test H0 : =0 and H1 : > 0, we compared

x 0/

n

vs. z

If instead, we wished to test H0 :

0 and H1 : > 0, what changes must we make to the

test procedure? How did H0 : = 0 influence the testing procedure?

We put down the H0 curve centered around 0, and then chose and identified an interval based on it.We meant that the probability of rejection of H0, should be (the small valued) . As we

relax H0 to now be 0 we need to ensure that the probability of rejection of H0 is nevergreater than . But the probability of accepting H0 for a given is (). Therefore, for the

probability of rejection to be never greater than , we would need

1 () for all 0

which implies that we would need

() 1 for all 0But, as seen before, () decreases as increases. Therefore () (0). But (0) = 1,

4 2 0 2 4 6 80.1

0.

0

0.

1

0.

2

0.3

0 1

Fig. B.28: and and therefore we confirm that

() = 1 for all 0Therefore. a test for H0 : = 0 at level also works for H0 : 0 at level .

B.4 Parametric hypothesis testing

Overview of Parametric Tests discussed

1. Testing the mean of a Normal Population variance known (z-Test)

2. Testing the mean of a Normal Population variance unknown (t-Test)

3. Testing the variance of a Normal Population (2-Test)

4. Testing the Binomial proportion

5. Testing the Equality of Means of Two Normal Populations, variances known

6. Testing the Equality of Means of Two Normal Populations, variances unknown but assumed

equal

7. Testing the Equality of Means of Two Normal Populations, variances unknown and found

to be unequal

8. Testing the Equality of Variances of Two Normal Populations (F-Test)

9. Testing the Equality of Means of Two Normal Populations, the Paired t-Test.

The approach for any Parametric Test

Step 1. State the hypothesis.

a. Identify the random variable and the relevant parameter to be tested.

b. Clearly state H0 and H1.

c. Identify the distribution that applies under H0.

Step 2. Identify the significance level () at this stage.

Step 3. Based on H1 and identify the critical threshold(s).


37/52


a. The relevant (standardized) sampling distribution needs to be identified knowing sample

size/degrees of freedom.

b. One-sided or two-sided test?

Step 4. Compute the test statistic from the sampled data.

Step 5. Identify the test result given the critical level (based on ).Step 6. Report the test result, and the p-value.

1. The z-Test: One sample test for when the variance is known


H0 : = 0, H1 : = 0.Under H0, x N(0, 2)


We set = 0.05.

Step 3. Based on H1 and identify the critical threshold(s).At = 0.05, on the Standard Normal, for a two-sided test, z/2 = 1.96.


z =x 0/

n

vs. N(0, 1)

Step 5. Identify the test result given the critical level (based on ).

Step 6. Report the test result, and the p-value.

2. The t-Test: One sample test for when the variance is unknownStep 1. State the hypothesis.

H0 : = 0, H1 : = 0.Under H0, x N(0, 2). 2 is unknown. Standardize and use t/2,n1 curve.


We set = 0.05.


Use = 0.05, on the tn1 curve


t =x 0s/

n

vs. tn1



Example B.4 12 rats are subjected to forced exercise. Each weight change (in g) is the

weight after exercise minus the weight before. Does exercise cause weight loss?

1.7, 0.7, -0.4, -1.8, 0.2, 0.9,

-1.2, -0.9, -1.8, -1.4, -1.8, -2.0


38/52


Soln:

Step 1. H0 : = 0, H1 : = 0.Step 2. Use = 0.05

Step 3. t0.025,11 = 2.201

Step 4. n = 12. x =

0.65 g and s2 = 1.5682 g2.

x s/

n

= 1.81

Step 5. Since |t| < |t/2,n1| then H0 cannot be rejected.Step 6. p-value = ?

Example B.5 The level of metabolite X measured in 12 patients 24 hours after they

received a newly proposed antibiotic was 1.2 mg/dL. If the mean and standard deviation of X

in the general population are 1.0 and 0.4 mg/dL, respectively, then, using a significance level

of 0.05, test if the level of X in this group is different from that of the general population.

What happens if = 0.1?

Soln:

Step 1. H0 : = 1.0, H1 : = 1.0.Step 2. Use = 0.05

Step 3. z0.025 = 1.96

Step 4. n = 12, x = 1.2, = 0.4

z =x /

n

= 1.732 vs. 1.96

Step 5. H0 cannot be rejected at = 0.05.

Step 6. p-value = 2 P(z > 1.732) = 2 0.0416 = 0.0832What happens at = 0.1?

Example B.6 (Example B.5 contd.) Suppose the sample standard deviation of metabo-

lite X is 0.6 mg/dL. Assume that the (population) standard deviation of X is not known, and

perform the hypothesis test.

Soln:

Step 1. H0 : = 1.0, H1 : = 1.0.Step 2. Use = 0.05

Step 3. t0.025,11 = 2.201

Step 4. n = 12, x = 1.2, s = 0.6

t =x s/

n

= 1.155 vs. 2.201

Step 5. Cannot reject H0.

Step 6. P(t < 1.155) on a t11 curve is 0.8637.

p = 2 (1 0.8637) = 0.272

Example B.7 The nationwide average score in an IQ test = 120. In a poor area, 100 studentscores are analyzed and mean = 115, and standard deviation is 24. Is the area average lower

than the national average?


39/52


Soln:

Step 1. H0 : = 120 vs. H1 : < 120

Step 2. Use = 0.05.

Step 3. t0.05,99 = 1.66. z0.05 = 1.646. Use either value.

Since we have H1 : < 120, our critical level is -1.66Step 4.

t =x s/

n

=115 12024/

100

= 2.08 vs. 1.66

Step 5. We reject H0 at a significance level of 0.05.

Step 6. The p-value is P(t99 < 2.08) = 0.02. Result is statistically significant.

Could not have rejected H0 at significance level 0.01.

Example B.8 (Example B.7 contd.) Suppose 10,000 scores were measured with mean

= 119 and s = 24. Redo the test.

Soln:

Step 1. H0 : = 120 vs. H1 : < 120

Step 2. Use = 0.05.

Step 3. t0.05,9999 = z0.05 = 1.646.

Since we have H1 : < 120, our critical level is -1.646

Step 4.

t = x s/

n

= 119 12024/

10000

= 4.17 vs. 1.646

Step 5. We reject H0 at a significance level of 0.05.

Step 6. The p-value is P(t9999 < 4.17) = P(z < 4.17) < 0.001. Result is very statistically significant.

Result is probably not scientifically significant.

Example B.9 A pilot study of a new BP drug is performed for the purpose of planning a

larger study. Five patients who have a mean BP of at least 95 mm Hg are recruited for thestudy and are kept on the drug for 1 month. After 1 month the observed mean decline in BP

in these 5 patients is 4.8 mm Hg with a standard deviation of 9 mm Hg.

If d= true mean difference in BP between baseline and 1 month, then how many patients

would be needed to have a 90% chance of detecting a significant difference using a one-tailed

test with a significance level of 5%?

Example B.10 (Example B.9 contd.) We go ahead and study the preceding hypothesis

based on 20 people (we could not find 31). What is the probability that we will be able to

reject H0 using a one-sided test at the 5% level if the true mean and standard deviation of

the BP difference is the same as that in the pilot study?


40/52


3. The 2-Test: One sample test for the variance of a Normal population.


H0 : 2 = 20 , H1 :

2 = 20 .Standardize s2 and use 2n1 curve.We need both thresholds: 2

/2,n1and 2

1/2,n1Step 2. Identify the significance level () at this stage.

We set = 0.05.


Use = 0.05, on the 2/2,n1 curve


(n 1)s220

vs. 2n1


Example B.11 We desire to estimate the concentration (g/mL) of a specific dose of

ampicillin in the urine after various periods of time. We recruit 25 volunteers who have

received ampicillin and find that they have a mean concentration of 7.0 g/mL, with a

standard deviation of 2.0 g/mL. Assume that the underlying distribution of concentrations

is normally distributed.

1. Find a 95% confidence interval for the population mean concentration.

Soln: (6.17, 7.83)

2. Find a 99% confidence interval for the population variance of the concentrations.

Soln: (2.11, 9.71)

Example B.12 (Example B.11 contd.) How large a sample would be needed to ensure

that the length of the confidence interval in part (a) (for the mean) is 0.5 g/mL if we assume

that the standard deviation remains at 2.0 g/mL?

Soln: n = 246

Example B.13 Iron deficiency causes anemia. 21 students from families below the poverty

level were tested and the mean daily iron intake was found to be 12.50 mg with standard

deviation 4.75 mg. For the general population, the standard deviation is 5.56 mg. Are the

two groups comparable?

Soln: 14.6 lies within (9.591, 34.170) and hence cannot reject H0.

Soln: Find the p-value yourself.


41/52


4. A Test for the Binomial Proportion

One sample test for p assuming that the Normal approximation to the Binomial is valid


H0 : p = p0, H1 : p = p0.Under H0, x N(np0, np0(1 p0)).Standardize and use N(0, 1) curve.


We set = 0.05.


Use = 0.05, on the z curve


z =x np0

np0(1 p0)vs. z/2


5. A Test for the equality of means of two Normal populations, when the variances are

known

Consider measurements xi and yi arising out of two different populations. We assume that

we have n measurements for xi and m for yi.

xi N

x, 2x

yi N

y,

2y

(B.104)

Then the point estimates of the means for the two populations are

x N

x,2xn

y N

y,

2ym

(B.105)

This therefore implies thatRemember that variances of

independent variables add

up.x y N

x y,

2x

n+

2ym

(B.106)


H0 : x y = 0, H1 : x y = 0.Under H0, x y follows Eq. B.106. Standardize and use N(0, 1) curve.

Step 2. Identify the significance level () at this stage.We set = 0.05.


At = 0.05, on the Standard Normal, for a two-sided test, z/2 = 1.96.


z =(x y)

$$$$

$X 0(x y)

2xn +

2ym

vs. N(0, 1)

If 2x = 2y =

2 (as would be expected when the samples sets are expected to be coming

from the same population under H0), then

x y N

x y ,

1

n+

1

m

(B.107)


42/52


43/52


The difficulty that remains is identifying the degrees of freedom of the t distribution that

needs to be looked up. The recommended approach is to find the degrees of freedom d

as

follows:

d

= S2xn +

S2ym

2

(S2x/n)2n1 + (

S2y/m)

2

m1

(B.112)

d

= d

rounded down (B.113)


H0 : x y = 0, H1 : x y = 0.Under H0, x y follows Eq. B.106. Standardize and use td curve.


We set = 0.05.


Use = 0.05, on the td curve

Step 4. Compute the test statistic from the sampled data using Eq. B.111.Step 5. Identify the test result given the critical level (based on ).


8. A Test for the equality of variances of two Normal populations (F-test)

When the population variances for the two Normal populations are unknown, we have (in the

previous two tests) developed tests for equality of the means, assuming that the variances are

either equal (Case 6) or unequal (Case 7). What we should have been doing, before testing

for the equality of the means, was to first ask whether the variances could be assumed equal,

especially considering that the methodology in Case 6 is simpler than that in Case 7. From

the discussion on the 2 distribution, we have a ratio of two 2 distributions (with in general

different degrees of freedom), which is a new, F, distribution, dependent on the two sets of

degrees of freedoms.

(n 1)S2x2x

2n1(m 1)S2y

2y 2m1 (B.114)

S2x/2x

S2y/2y

2n1/(n 1)

2m1/(m 1)= Fn1,m1 (B.115)

Under H0 : 2x =

2y, we get

S2x

S2

y Fn1,m1 (B.116)

Then H0 cannot be rejected ifNote that the outcome of the

F-test should not depend on

which data set is in the nu-

merator of Eq. B.116.

F1/2,n1,m1 S2x

S2y F/2,n1.m1 (B.117)

The choice of which set is labeled x and y is not important given the following equivalence

between two F distributions:

P

2n1/(n 1)

2m1/(m 1)> F,n1,m1

= P

2m1/(m 1)2n1/(n 1)

1

F,n1,m1

= 1 = P

2m1/(m 1)2n1/(n 1)

> F1,m1,n1

(B.119)


44/52


which shows that1

F,n1,m1= F1,m1,n1 (B.120)

The F-test is summarized as follows:


H0 : 2x =

2y , H1 :

2x = 2y .

Under H0, we expect Eq. B.116.


We set = 0.05.


Use = 0.05, on the Fn1,m1 curveStep 4. Compute the test statistic, S2x/S

2y , from the sampled data.



Sample sizes for two sample tests of the meansConsider H0 : x = y and H1 : x = y. Under H0, the lower critical threshold is (instandardized coordinates) z/2. In the original coordinate system, relative to the H0 curve,this threshold is

$$$$

$X0(x y) z/2

2xn

+2ym

= z/2

2xn

+2ym

But this coordinate, relative to the H1 curve is known to be (after standardization), z.

z =

z/2

2xn +

2ym

(x y)

2xn +

2ym

(B.121)

This can be rearranged to get, using k = m/n,z/2 + z

x y

2=

n

2x + 2y/k

(B.122)

and hence, given and , the sample sizes required to differentiate between two populations

are

n =

z/2 + z

x y

22x +

2yk

(B.123)

m = nk (B.124)

9. A Test for the equality of means of two Normal populations, when the samples arepaired (the paired t-test)

Consider n pairs of before and after measurements, which reflect, for example, the efficacy

of some treatment. Then the n differences di can be used to compute the mean sample

difference on treatment (= d), and the sample standard deviation sd(di). We have one vector

of difference values, with the average difference expected to be zero, according to H0. Since

the population variation is unknown, we have a t test with n 1 degrees of freedom.

d Edsd(d)

=d B

0

Edsd(di)/

n

tn1 (B.125)

Then, the paired t-test may be summarized as follows.


H0 : E

d

= = 0, H1 : = 0.


45/52


Under H0, we expect Eq. B.125 to hold.


We set = 0.05.


Use = 0.05, on the tn1 curve

Step 4. Compute the test statistic, nd/sd(di), from the sampled data.Step 5. Identify the test result given the critical level (based on ).


B.5 Nonparametric hypothesis testing

When the data available to us

does not in any fashion obviously follow some distribution,

or when the data is ordinal in nature rather than cardinal,the parametric hypothesis tests previously described cannot be applied. We then resort to

nonparametric hypothesis tests. Two essential features of these tests are that we pay attention

to the median of a data set, rather than the arithmetic mean, and that we endeavor to identify a

statistic, like a proportion, or a rank, which can be expected to follow a parametric distribution.

In this manner, we drag things back towards the more convenient parametric hypothesis tests.

We review below three nonparametric tests which will be seen to be counterparts of some of

the two sample parametric tests already discussed.

1. The Wilcoxon sign test.

2. The Wilcoxon signed rank test.3. The Wilcoxon rank sum test.

The Wilcoxon sign test

The paired t-test is ap-

plied under the assumption

that all measurements follow

some Normal distribution. If

the underlying distribution

is not Normal, or if it is un-

known, then this test is a

safer test.

The Wilcoxon sign test is the nonparametric counterpart to the paired t test, and is usually

invoked to evaluate whether two data sets are similar. Let di indicate the relative like or

dislike for two categories (icecreams). Then we have 3 categories of data: di < 0 where flavor

A is preferred, di > 0 with B being preferred, and di = 0 with equal preference for both

flavors. We wish to determine whether there is a marked preference for one flavor versus

the other. H0 would suggest an equal number of people preferring the two icecreams. The

actual numbers in the survey voting for the two icecreams should turn out to be close. Note

that those who cannot make up their minds, with di = 0, are irrelevant to the test of H0.

Therefore, if n scores have di = 0, of which c have di > 0, the test itself will be to see if c isclose enough to the median of the data set, n/2. If = E[c] is the population median of theunderlying distribution of di, then

H0 : E[c] = = n2

H1 : = n2

Under H0, the binomial proportion of finding nonzero di > 0 is 1/2. Further, assuming the

Binomial Normal approximation, Var [c] = np(1 p) 10, and under H0, since p = 1/2,we get n 40. Then we reject H0 if c E[c]sd(c)

=c n/2n/4

> z/2 (B.126)


46/52

B27 Nonparametric hypothesis testing

A continuity correction is required because we are shifting over from a discrete to a continuous

distribution. H0 is rejected if

If c >n

2:

c n2

12

n/4

> z/2

If c < n2

:c n2 + 12n/4

> z/2 (B.127)

Example B.14 2 flavours of icecream undergo a taste test. 45 people are asked to choose,

and 22 prefer A, 18 prefer B, and 5 cannot make up their minds. Is one flavour preferred over

the other?

Soln:

n = 40, c = 18, the statistic is

(18 20) + (1/2)40/4

= 1.510

= 0.47

The p-value is 2 (1 (0.47)) = 2 (1 0.6808) 0.64. Hence H0 cannot be rejectedaccording to this test.

The Wilcoxon signed rank test

Rather than simply resort to proportions employed in the Sign test, we bring in some more

detail into the hypothesis test, but ordering the measurements and assigning ranks to them.

We then expect the ranks to be equally distributed between corresponding quantile groups

on either side of the median. This therefore ends up testing whether the distribution function

has similar shapes on either side of the median, as would be expected of H0 where the two

groups are supposed to be identical. The procedure is as follows:

1. Arrange the dis in order of absolute value and get frequency counts.

2. As in the sign test, ignore the di = 0 observations.

3. Rank the remaining observations from 1 to n.

4. Let R1 be the sum of ranks for group 1.

5. H0 assumes that the ranks are equally shared between the two groups, and H1 assumes

otherwise. Since n ranks are given out, the sum of all ranks is n(n + 1)/2, and hence therank sum for group 1 should be n(n + 1)/4 .

E[R1] = n(n + 1)4

, Var [R1] =n(n + 1)(2n + 1)

24(B.128)

6. For n 16, we assume that the rank sum follows a Normal distribution

R1 N

n(n + 1)

4,

n(n + 1)(2n + 1)

24

(B.129)

7. The test statistic T, bringing in the continuity correction again, is

|R1 n(n+1)

4 | 12n(n+1)(2n+1)

24

(B.130)


47/52


8. A further correction factor is required for the variance, if there are subgroups with tied

ranks. The variance calculated above will be an overestimate. If g indicates the number of

tied groups, and ti the number of tied values in that group, then the corrected variance is

Var [R1] =n(n + 1)(2n + 1)

24

g

i=1

t3i ti48

(B.131)

Equation B.130 needs to be correspondingly updated.

Example B.15 (Example B.14 contd.) The 45 volunteers rate each icecream on a scale

from 1 to 10. (data in the margin). (10 = really really like it, 1 = dont care). Let di =

score(B) - score(A). Then di > 0 B is preferred.

di < 0 di > 0

|di| fi fi10 0 0

9 0 0

8 1 0

7 3 0

6 2 0

5 2 0

4 1 0

3 5 22 4 6

1 4 10

Soln:

di < 0 di > 0 Total # of Range of Average

|di| fi fi people ranks rank10 0 0 0

9 0 0 0

8 1 0 1 40 40

7 3 0 3 37-39 38

6 2 0 2 35-36 35.5

5 2 0 2 33-34 33.5

4 1 0 1 32 32

3 5 2 7 25-31 28.0

2 4 6 10 15-24 19.5

1 4 10 14 1-14 7.5

We conveniently choose R1 to be the group corresponding to di > 0 since it has fewer nonzero

frequencies (the arithmetic becomes simpler).

R1 = 10(7.5) + 6(19.5) + 2(28.0) = 248

E[R1] = 40 414

= 410

Var [R1] =40

41

81

24 (143

14) + (103

10) + (73

7) + (33

3) + 2

(23

1)

48 = 5449.75

sd(R1) =

5448.75 = 73.82

T =|(248 410)| (1/2)

73.82= 2.19

The p-values = 2(1 0.9857) 0.03 and hence H0 is rejected. Notice that by including theLikely because flavor A is

preferred.quantile level information, the signed rank test gives a different conclusion to that of the Sign

test. Since this test has involved more detailed analysis, the conclusion here is preferable to

that in Example B.14. Observe, from the method employed, that this test is the nonparametric

equivalent of the paired t test.


48/52

B29 Nonparametric hypothesis testing

The Wilcoxon rank sum test

aka Mann Whitney test aka

U TestThe rank sum test is the nonparametric equivalent of the t-test for 2 independent samples.

The procedure involves assigning ranks after combining the data from the two groups. As

with the signed rank test, the rank sum for a group is compared with the expected rank sum.

Let R1 represent the rank sum in the first group, and let there be n1 and n2 samples in thetwo groups. Then, the rank sum of all the ranks is (n1 + n2)(n1 + n2 + 1)/2 and hence the

average rank for the combined sample is (n1 + n2 + 1)/2. Under H0,

E[R1] = n1 n1 + n2 + 12

Var [R1] = n1n2n1 + n2 + 1

12(B.132)

If n1 and n2 are 10, the rank sum may be assumed to have an underlying continuous(Normal) distribution:

R1 N

n1(n1 + n2 + 1)

2,

n1n2(n1 + n2 + 1)

12

(B.133)

and hence the appropriate test statistic is

T =|R1 n1(n1+n2+1)2 | 12

n1n2(n1+n2+1)12

(B.134)

Once again a correction factor needs to be applied to correct for groups of tied values. If g

indicates the number of tied groups, and ti the number of tied values in that group, then the

corrected variance is

Var [R1] =n1n2(n1 + n2 + 1 S)

12S =

gi=1

t3i ti(n1 + n2)(n1 +

Documents

Univariate Stats