Univariate Stats

Embed Size (px)

Citation preview

  • 7/30/2019 Univariate Stats

    1/52

    AAppendix A Introduction toProbability and Statistics

    A.1 Introduction

    If youve heard this story

    before, dont stop me, be-

    cause Id like to hear it

    again. Groucho Marx

    Many questions have probabilistic solutions. For example, are two objects/motifs similar

    because of some relationship, or is it coincidence? How do you classify similar items? How

    surprising is the occurence of an unusual event? How do we describe probabilistic outcomes

    using models? What do we do when an exact model is not available? Which hypothesis/model

    is more relevant/accurate/believable?

    Before we talk models, we need to know about discrete and continuous random variables,

    and the testing of hypotheses. Random experiments must satisfy the following:Random experiments

    1. All possible discrete outcomes must be known in advance.

    2. The outcome of a particular trial is not known in advance, and

    3. The experiment can be repeated under identical conditions.

    A sample space S is associated with a random experiment = set of all outcomes.

    Example A.1 S for a coin toss is {H, T}, and for a die roll = {1, 2, 3, 4, 5, 6}. An event Ais a subset of S, and we are usually interested in whether A happens or not. e.g. A = odd

    value on roll of a die; A =H on tossing a coin.

    The random variable X is some function mapping outcomes in space S to the real line.

    X : S R. e.g. with two coin tosses, S = {(H, T), (T, H), (H, H), (T, T)}, X = number ofheads turning up, X(H, T) = 1, X(H, H) = 2 etc.

    Random variable Random variable: A real valued function assigning probabilities to different events in a

    sample space. This function could have discrete or continuous values. For example,

    1. Sum of two rolls of a die ranges from 2 122. The sequence HHTH of coin tosses is not a random variable. It must have a numerical

    mapping.

    Such variables are i.i.d.: in-

    dependent and identically

    distributed.

    A function of a random variable defines another random variable. Measures such as mean and

    Random variable = Nocontrol over outcome.

    variance can be associated with a random variable. A random variable can be conditioned

    on an event, or on another random variable. The notion of independence is important: the

    random variable may be independent from an event or from another random variable.

    Statistics refers to drawing inferences about specific phenomena (usually a random phe-

    nomenon) on the basis of a limited sample size. We are interested in descriptive statistics: we

    need measures of location and measures of spread.

    Data scales are classified as categorical, ordinal or cardinal. Categorical scale: A label per

    individual (win/lose etc.). If the label = name, we refer to the data as nominal data. Ordinal

    data can be ranked, but no arithmetic transformation is meaningful. e.g. vision (= 20-20,

    20-30), grades etc. Cardinal data is available in interval or ratio scales. Ratio scale: we knowWe mostly use cardinal data.

    zero point (Eg. ratio of absolute temperatures is meaningful in K and not C. The same formass flow ratios.) Interval scale (arbitrary zero): F, C.

    Cardinal data may be sorted and ranked: info may be lost in the process, but analysis

    A1cSBN, IITB, 2012

  • 7/30/2019 Univariate Stats

    2/52

    A2 Introduction to Probability and Statistics

    may be easier. Consider the following transformation: If 4 individuals A, B, C & D have

    heights (in inches) 72, 60, 68 and 76 respectively, they may be transformed to ranks: 2, 4,

    3 & 1. The height difference between B & A = difference between B and C, and thereforetransformations may lead to incorrect inferences. Ordinal data are protected by monotonic

    transforms: If x > y , then log(x) > log(y); however |x y| = | log(x) log(y)| in general.

    A.2 Discrete random variables

    Measures of location10 5 0 5 10

    0.0

    0.1

    0.2

    0

    .3

    0.4

    x

    Density

    yi == xi yi == xi ++ c

    I II

    Fig. A.1: Translating variables

    on a linear scale.

    Mean

    The sample mean is x and the population mean is .

    x =1

    n

    n

    i=1

    xi =1

    N

    N

    i=1

    xi (A.1)

    where N = size of total population and is usually unknown. The arithmetic mean can be

    translated or rescaled.

    Translation: If c is added to every xi, then if yi = xi + c, then y = x + c (Fig. A.1).

    Scaling: If yi = cxi then y = cx.

    Median

    For n observations, the sample median is

    (n + 1)/2th largest observation for odd n,

    average of n/2th and (n/2) + 1th largest observations for even n.

    Mode

    Most frequently occurring value. Can be problematic if each xi is unique.

    Geometric mean

    xG = (ni=1 xi)

    1/nand on a log scale, antilog[(

    ni=1 log xi)/n]

    Harmonic mean

    1/xH = (1/n)ni=1(1/xi). For positive data, xH < xG < x.

    Display of data

    Pareto diagram

    Dot diagram

    2 0 2 4 6

    2

    4

    6

    8

    10

    value

    frequency

    Fig. A.2: The dot diagram.Given n = {3, 6, 2, 4, 7, 4, 3} the dot diagram is shown in Figure A.2. This simple plot can beuseful if there exists lots of data. From a visual inspection of the plot, 2 could be consideredan outlier. Another advantage of these plots is that they can be superimposed.

    Frequency Distributions

    x

    Frequency

    2 3 4 5 6 7 8 9

    0

    1

    2

    3

    4

    5

    Fig. A.3: A frequency distribu-

    tion.

    A frequency distribution is shown as a histogram in Figure A.3. Frequencies for each class/category

    are tallied (sometimes using the # notation). Endpoints can be conveniently represented us-

    ing bracket notation: in [1, 10), 1 is included and 10 is not. In [10, 19) 10 is included whentallying. Grouping together data loses information (in the form of specific individual values).

    Cumulative distributions can also be generated from the frequency curves. If the height of a

  • 7/30/2019 Univariate Stats

    3/52

    A3 Discrete random variables

    rectangle in a histogram is equal to the relative frequency/width ratio, then total area = 1 we have a density histogram.

    Stem and leaf displays

    1

    2

    3

    4

    5

    2 7 5

    9 1 5 4 8

    4 9 2 4

    4 8 2

    3

    i.e.

    12 17 15

    29 21 25 24 28

    34 39 32 34

    44 48 42

    53

    Measures of spread

    Range

    The range is the difference between largest and smallest observations in a sample set. It is

    easy to compute and is sensitive to outliers.

    Quantiles/percentiles

    The pth percentile is that threshold such that p % of observations are at or below this value.

    Therefore it is the

    (k + 1)th largest sample point if np/100 = integer. [k = largest integer less than np/100]. Average of (np/100)th and (np/100 + 1)th values if np/100 = integer. For 10th percentile

    p/100 = 0.1

    The median = 50th percentile. The layout of quantities which calculation of percentiles

    requires gives information about spread/shape of distribution. However data must be sorted.

    Let Q1= 25th percentile, Q2= 50

    th percentile = median, and Q3 = 75th percentile. Then

    2

    3

    4

    5

    6

    7

    8

    9

    Q1

    Q2

    Q3

    Fig. A.4: A box plot. for the pth percentile, (1) order the n observations (from small to large), (2) determine np/100,

    (3) if np/100 = integer, round up, (4) if np/100 = integer = k, average the kth and (k + 1)thvalues.

    The range = (min value - max value) was strongly influenced by outliers. The interquartile

    range = Q3 Q1 = middle half of data and is less affected by outliers. The boxplot (Fig. A.4)is a graphical depiction of these ranges. A box is drawn from Q1 to Q3, a median is drawn as

    a vertical line in the box, and outer lines are drawn either up to the outermost points, or to

    1.5(box width) = 1.5(interquartile range).

    Deviation

    The sum of deviations from the mean is d = ni=1

    (xi

    x)/n = 0 and is not useful. The

    mean deviation isni=1 |xi x|/n. This metric does measure spread, but does not accurately

    reflect a bell shaped distribution.

    Variance and Standard deviation

    A variance may be defined as

    S2 =

    ni=1(xi x)2

    n 1 2 =

    Ni=1(xi )2

    N(A.2)

    where N denotes the size of a large population. The sample variance S2 uses (n 1) insteadof n in the denominator. (There are only (n 1) independent deviations if x is given, with

    their sum being constrained to be 0). This definition of S2

    with (n 1) is used to betterrepresent the population variance 2. A large S2 implies large variability. S2 is always > 0.

    The definition of S2 above requires that x be computed first and then S2; two passes of the

    10 5 0 5 10

    0.0

    0.1

    0.2

    0.3

    0.4

    x

    Density

    yi == xi

    yi == cxiIII

    Fig. A.5: Effect of scaling on

    variance.

  • 7/30/2019 Univariate Stats

    4/52

    A4 Introduction to Probability and Statistics

    data are required. A more convenient form (given below) permits simultaneous computation

    of

    x2i and

    xi, and thus requires one pass.

    S2 =

    ni=1(xi x)2

    n 1 =ni=1 x

    2i (

    xi)

    2/n

    n 1 (A.3)

    For translated data, yi = xi + c (i = 1 to n) y = x + c and S2

    y = S

    2

    x. For scaled samples,yi = cxi for i = 1, ...,n, c > 0 S2y = c2S2x (Fig. A.5).

    Coefficient of variation

    CV = 100%S

    x(A.4)

    This metric is dimensionless, hence one can discuss S relative to the means magnitude.

    This is useful when comparing different sets of samples with different means, with larger

    means usually having higher variability.

    Indices of diversity

    Such indices are used for nominal scaled data where mean/median do not make sense e.g. todescribe diversity among groups. Uncertainty is a measure of diversity. For k = categories

    More on uncertainty later.

    For now, notice the connec-

    tion to entropy!

    and pi = fi/n = proportion of observations in a category

    H = ki=1

    pi logpi =n log n ki=1 fi log f

    n(A.5)

    Hmax = log k. Shannons index J = H/Hmax.

    Probability mass function (probability distribution)

    The assignment of a probability to the value r that a random variable X takes is represented

    as P rX(X = r) Eg. Two coin tosses X = number of heads.

    P rX(x) =

    1/2 if x = 1

    1/4 if x = 0 or x = 2

    0 otherwise

    Obviously 0 P r(X = r) 1 and rSP(X = r) = 1.Notation: we will use P(X =

    r) instead of P rX(X = r)A probability distribution is a model based on infinite sampling, identifying the fraction of

    outcomes/observations in a sample that should be allocated to a specific value. Consider a

    series of 10 coin tosses where the outcome = 4 heads. A probability distribution can tell us

    how often 4 heads crop up when the 10 coin toss experiment is repeated 1000 times.

    To establish the probability that a drug works on {0, 1, 2, 3, 4} out of 4 patients treated,theoretical calculations (from some distribution) are compared with some frequency counts

    generated from 100 samples.

    r P(X = r) Frequency distribution (100 samples)

    0 0.008 0.000 [0/100]

    1 0.076 0.090 [9/[100]

    2 0.265 0.240 [24/100]

    3 0.411 0.480 [48/100]

    4 0.240 0.190 [19/100]

    Compare the two distributions. Is clinical practice expectations from experience? Givendata and some understanding of the process, you can apply one out of many known PMFs

    to generate the middle column above.

  • 7/30/2019 Univariate Stats

    5/52

    A5 Discrete random variables

    Expectation of a discrete random variable

    The expectation of X is denoted

    E[X] =R

    i=1

    xiP(xi) (A.6)

    where xi = value i of the random variable. e.g. probability that a drug works on {0, 1, 2, 3, 4}out of four patients. E[X] = 0(0.008) + 1(0.076) + 2(0.265) + 3(0.411) + 4(0.240) = 2.80.Therefore 2.8 out of 4 patients would be cured, on average.

    r P(X = r)0 0.008

    1 0.076

    2 0.265

    3 0.411

    4 0.240

    1.000

    For the case of two coin tosses, if X = number of heads, then E[X] = 0(1/4) + 1(1/2) +2(1/4) = 1. If the probability of a head on a coin toss was 3 /4

    P(X = k) =

    (1/4)2 = 1/16 k = 0 Binomial random variable

    2 1/4 3/4 k = 1 n = 2, p = 3/4(3/4)2 k = 2

    E[X] = 0(1/4)2 + 1(2

    1/4

    3/4)2 + 2(3/4)2 = 24/16 = 1.5

    Example A.2 Consider a game where 2 questions are to be answered. You have to decideExpectations can be com-

    pared.which question to be answer first. Question 1 may be correctly answered with probability 0.8

    in which case you win Rs. 100. Question 2 may be correctly answered with probability 0.5

    in which case you win Rs. 200. If the first question is incorrectly answered, the game ends.

    What should you do to maximize your prize money expectation?

    Soln:

    Notice a tradeoff: Asking the more valuable question first runs the risk of not getting to

    answer the other question. If total prize money = X = random var, then E[X] assuming Q1is asked first is P MF[X] : pX(X = 0) = 0.2, pX(X = 100) = pX(X = 300) = 0.8 0.5.Therefore E[X] = 0 0.2 + 100 0.8 0.5 + 300 0.8 0.5 = Rs. 160.

    E[X] assuming Q2 is asked first is P M F[X] : pX(X = 0) = 0.5, pX(X = 100) = 0.5 0.2, pX(X = 300) = 0.5 0.8 hence E[X] = 0 0.5 + 200 0.5 0.2 + 300 0.5 0.8 =Rs.140. Hence ask question 1 first.

    Example A.3 The weather is good with a probability of 0.6. You walk 2 km to class at aExpectation comparisons

    can be misleading!speed V = 5 kph or you drive 2 km to class at a speed V = 30 kph. If the weather is good

    you walk. what is the mean of the time T to get to class? What is the mean of velocity V?

    Soln:

    PMF of T = PT(t) = 0.6 if t = 2/5 hrs0.4 if t = 2/30 hrs

    Therefore E[T] = 0.6 2/5 + 0.4 2/30 = 4/15 hrs. E[v] = 0.6 5 + 0.4 30 = 15 kph. NoteT = 2/V. However E[T] =E[2/V] = 2/E[V].

    that E[T] = 4/15 = 2 km/E[v] = 2/15.

    Variance of a discrete random variable

    The variance of a discrete random variable is defined as

    Var [X] = 2X =

    Ri=1

    (xi )2P(X = xi) (A.7)

    The standard deviation SD(X) =Var [X] = and has the same units as the mean (and X).Using the E[] notation,

    Var [X] = 2X = E

    (X E[X])2 = EX2 (E[X])2 = EX2 2 (A.8)

  • 7/30/2019 Univariate Stats

    6/52

    A6 Introduction to Probability and Statistics

    Moments

    k denotes the kth moment about the origin

    =

    ixki PX(xi) (A.9)

    Hence the mean = = E[X] =i xiP(xi) = 1st moment about origin. k is the kth momentabout mean.

    k =i

    (xi )kPX(xi) (A.10)

    Variance = 2nd moment about mean = 2. Then 2 =

    2 23/3 = skewness (degree of asymmetry).

    33

    =E(X E[X])3

    3(A.11)

    Skew to right (+ve skew) or left (ve skew). The mean is on same side of mode as longer tail.An alternate measure of skew is (mean - mode)/, but to avoid using mode, use skewness =3(mean - median)/.

    Kurtosis (peakedness) is

    44

    =E(X E[X])4

    4(A.12)

    A moment generating function of X is defined asIf t is replaced with syou have the Laplace trans-

    form, and if replaced with it,

    the Fourier transform. Its a

    small world...

    M(t) = MX(t) = E

    etX

    = E

    1 + tX+t2x2

    2!+ ...

    (A.13)

    Then n = E[Xn] is the nth moment of X and is equal to M(n)(0) (from a comparison witha Taylors expansion of M(t)), the nth derivative of M, evaluated at t = 0.

    M(t) = 1 + 1t +2t2

    2!+ ... +

    ntnn!

    + ...

    If y =ni=1 xi, then

    Like a Binomial (several

    coin tosses) from several iid

    Bernoullis.

    My(t) = E

    ety

    = Eetx1etx2 ...etxn = Eetx1 Eetx2 ...Eetxn= Mx1(t)Mx2(t)...Mxn(t) =

    ni=1

    Mxi(t) (A.14)

    Chebyshevs theorem

    If a probability distribution has mean and standard deviation , the probability of getting

    a value deviating from by at least k is at most 1/k2. We can talk in terms of k, for

    example 2, 3 etc.

    P(|X E[X]| k) 1k2

    (A.15)

    This holds for any probability distribution!2 0 2 4 6 8

    0.0

    0.1

    0.2

    0.3

    0.4

    x

    Den

    sity

    R2R1 R3

    KK

    Fig. A.6: Proof of Chebyshevs

    theorem.

    Proof:

    2X =Ri=1

    (xi )2P(X = xi)

    Consider three regions R1, R2 and R3 (see Fig. A.6) where R1 : x k, R2 : k 1}. Then if X = 1, E[X|X = 1] = 1 (we have success on the first try). Ifthe first try fails (X > 1), one try is wasted, but we are back where we started expectednumber of remaining tries is still E[X] and therefore E[X|X > 1] = 1 + E[X]

    E[X] = P(X = 1)E[X|X = 1] + P(X > 1)E[X|X > 1]= p + (1 p)(1 + E[X]) = 1

    p(on rearranging)

    Similarly EX2|X = 1 = 1 which implies thatEX2|X > 1 = E(1 + E[X])2 = 1 + 2E[X] + EX2

    Therefore,

    EX2 = p 1 (1 p)(1 + 2E[X] + EX2)=

    1 + 2(1 p)E[X]p

    =2

    p2 1

    p

    Var [X] = EX2 (E[X])2 = 2p2

    1p

    1p2

    =1 p

    p2

    Multiple random variables: Joint PMFS

    Given two discrete random variables X and Y in the same experiment, the joint PMF of X

    and Y is PX,Y(X = x, Y = y) for all pairs of (x, y) that X and Y can take.

    PX(x) =y

    PX,Y(x, y) PY(y) =x

    PX,Y(x, y) (A.25)

  • 7/30/2019 Univariate Stats

    11/52

    A11 Discrete random variables

    PX(x) and PY(y) are called marginal PMFs. If Z = g(X, Y), PZ(z) =

    PX,Y(x, y) where

    the summation is over all (x, y).

    PZ(z) =

    {(x,y)|g(x,y)=z}PX,Y(x, y) (A.26)

    E[g(x, y)] =x,y

    g(x, y)PX,Y(x, y) (A.27)

    For example, E[aX+ bY + c] = aE[X] + bE[Y] + c.For three variables:

    PX,Y(x, y) =z

    PX,Y.Z(x,y,z) (A.28)

    PX(x) =y

    z

    PX,Y.Z(x,y,z) (A.29)

    E[g(X , Y , Z )] =x,y,z

    g(x,y,z)PX,Y.Z(x,y,z) (A.30)

    For n random variables X1, X2...Xn and scalars a1, a2...anWe saw this before: a bino-

    mial is constructed from sev-

    eral Bernoulli variables.

    E[a1X1 + a2X2 + ... + anXn] = a1E[X1] + a2E[X2] + ... + anE[Xn]For example,

    E[aX+ bY + cZ+ d] = aE[X] + bE[Y] + cE[Z] + d

    Conditional PMFs

    Given that event A has occurred, we can use conditional probability: this implies a new

    sample space where A is known to have happened. The conditional PMF of X, given that

    event A occurred with P(A) > 0 is

    PX|A(x) = P(X = x|A) = P({X = x} A)P(A)

    (A.31)

    x

    P({X = x} A) = P(A) (A.32)

    Events {X = x} A are disjoint for different values of x and hence PX|A = 1 PX|A is aPMF.

    Example A.5 X = roll of a die and A = event that roll is an even number

    PX|A = P(X = x|roll is even) P(X = x and X is even)P(roll is even) =

    1/3 if x = 2, 4, 60 otherwise

    Conditional PMF of X, given random variable Y = y is used when one random variable is

    conditioned on another.

    PX|Y = P(X = x|Y = y) = P(X = x, Y = y)P(Y = y) =PX,Y(x, y)

    PY(y)

    Note thatx PX|Y(x|y) = 1 for normalization. PX,Y(x, y) = PY(y)PX|Y(x|y) = PX(x)PY|X(y|x)

    we can compute joint PMFs from conditionals using a sequential approach.

    Example A.6 Consider four rolls of a six sided die. X = number of 1s and Y = numberof 2s obtained. What is PX,Y(x, y)?

  • 7/30/2019 Univariate Stats

    12/52

    A12 Introduction to Probability and Statistics

    Soln:

    The marginal PMF PY is a binomial.

    PY(y) =4Cy(1/6)

    y(5/6)4y, y = 0, 1, 2, 3, 4

    To get conditional PMFs PX|Y(x

    |y) given that Y = y, X = number of 1s in the 4

    y rolls

    remaining, values 1, 3, 4, 5, 6 occur with equal probability. ThereforePY|X(y|x) =4y Cx(1/5)x(4/5)4yx for all x, y = 0, 1, 2, 3, 4 and 0 x + y 4

    For x, y such that 0 x + y 4

    PX,Y(x, y) = PY(y)PX|Y(x|y)=4Cy(1/6)

    y(5/6)4y 4yCx(1/5)x(4/5)4yx

    For other (x, y), PX,Y(x, y) = 0

    Marginal PMFs can therefore be computed from conditional PMFs:This is the total probability

    theorem in a different nota-

    tion.

    PX(x) =y

    PX,Y(x, y) =y

    PY(Y = y)PX|Y(x|y)

    Independence

    Random variable X is independent of event A if P(X = x and A) = P(X = x)P(A) =

    PX(x)P(A) for all x. As long as P(A) > 0, since PX|A(x) = P(X = x and A)/P(A) thisimplies that PX|A(x) = PX(x) for all x.

    Random variables X and Y are independent ifThe value of Y(= y) has no

    bearing on the value of X(=

    x).

    PX,Y(x, y) = PX(x)PY(y) (A.33)

    for all x, y PX|Y(x, y) = PX(x) for all y with PY(y) > 0, and all x.Random variables X and Y are conditionally independent given an event A if

    p(A) > 0.P(X = x, Y = y|A) = P(X = x|A)P(Y = y|A) (A.34)

    for all x and y . PX|Y(x|y) = PX|A(x) for all x and y such that PY|A(y) > 0.For independent variables,

    Independence implies that

    E[XY ] = E[X] E[Y]. E[XY ] =x

    y

    xyPX,Y(x, y) =x

    y

    xyPX(x)PY(y)

    =x xP

    X(x)y yP

    Y(y) = E[X] E[Y] (A.35)Similarly

    E[g(X)h(Y)] = E[g(X)] E[h(Y)] (A.36)

    For two independent random variables X and Y,Expectations of combined

    variables always add up.

    Variances add up only if X

    and Y are independent.

    E[X+ Y] = E[X] + E[Y] (A.37)

    Var [Z] = E(X + Y E[X+ Y])2 = E(X+ Y E[X] E[Y])2= E((X E[X]) + (Y E[Y]))

    2

    = E(X E[X])2+ E(Y E[Y])2+ 2E[(X E[X])](Y E[Y])]= E(X E[X])2+ E(Y E[Y])2 = Var [X] + Var [Y] (A.38)

  • 7/30/2019 Univariate Stats

    13/52

    A13 Discrete random variables

    Independence for several random variables

    Independence PX,Y,Z(x,y,z) = PX(x)PY(y)PZ(z) for all x,y,z etc. Obviously g(X, Y)andh(Z) are independent. For n independent variables X1, X2...Xn

    Var [X1 + X2 + ...Xn] = Var [X1] + Var [X2] + ... + Var [Xn] (A.39)

    Example A.7 Variance of the binomial: Given n independent coin tosses , with p =

    probability of heads. For each i let Xi = Bernoulli random variable (= 1 if heads, = 0 if tails)

    X = X1 + X2 + ...Xn. Then X is a binomial random variable. Coin tosses are independent X1, X2....Xn are independent variables and

    Var [X] =ni=1

    Var [Xi] = np(1 p)

    Covariance and Correlation

    Covariance of X & Y (random vars) = cov [X, Y]

    cov[X, Y] = E[(X E[X])(Y E[Y])] = E[XY] E[X] E[Y] (A.40)When cov [X, Y] = 0, X & Y are uncorrelated. If X & Y are independent,

    If X & Y are independent,

    they are also uncorrelated. E[XY ] = E[X] E[Y] cov[X, Y] = 0

    The reverse is not true! Be-

    ware!Example A.8 Let (X, Y) take values (1, 0), (0, 1), (1, 0), (0, 1) each with probability1/4 i.e. marginal probability of X & Y are symmetric around 0. So E[X] = E[Y] = 0. For all

    possible values of (x, y), either x or y = 0.Uncorrelated variables need

    not be independent! XY = 0 E[XY ] = 0 cov [X, Y] = E[(X E[X])(Y E[X])] = E[XY ] = 0 X & Y are uncorrelated. However X & Y are not independent: nonzero X implies Y = 0etc.

    Correlation Coefficient of X & Y

    Let X & Y = random variables with nonzero variances. Correlation coefficient = normalizedNote that 1 XY 1. version of covariance =

    xy =cov [X, Y]

    Var [X]Var[Y](A.41)

    Example A.9 Consider n independent tosses of a biased coin (head = p). Let X = # of

    heads & Y = # of tails. Are X, Y correlated?

    Soln:

    For all (x, y), x + y = n E[X] + E[Y] = n X E[X] = (Y E[Y])cov [X, Y] = E[(X E[X])(Y E[X])] = E(X E[X])2 = Var [X]

    Note that Var [Y] = Var [n X] = Var [X] = Var [X].

    (X, Y) = cov [X, Y]Var [X]Var[Y]

    = Var [X]Var [X]Var[Y]

    = 1

  • 7/30/2019 Univariate Stats

    14/52

    A14 Introduction to Probability and Statistics

    Correlation

    For the sum of several, not necessarily independent variablesGeneral case.

    Var

    ni=1

    ciXi

    =

    ni=1

    c2iVar [Xi] + 2ni=1

    nj=1

    cicjcov [XiXj ] , i < j (A.42)

    Var [c1X1 + c2X2] = c21Var [X1] + c

    22Var [X2] + 2c1c2cov[X1X2]

    = c21Var [X1] + c22Var [X2]

    +2c1c2

    Var [X1]Var[X2](X1, X2)

    A.3 Continuous probability distributions

    Continuous vs Discrete: Continuous distribs are more fine grained, may be more accurate and

    permit analysis by calculus. A random variable X is continuous if its probability distribution

    can be defined by a non negative function fX of X such that

    P(X) =

    fX(x)dx (A.43)

    i.e P(X is between a and b) =ba

    fX(x)dx = area under the curve from a to b. Total area

    under the curve = 1. fX(x) is a probability density function (PDF). Intuitively, PDF of X is

    large for high probability and low for low probability. For a single value of X, P(X = a) =aa

    fX(x)dx = 0. So including endpoints does not matter

    P(a X b) = P(a < X < b) = P(a X < b) = P(a < X b)

    For fX to be a valid PDF, it must befX(x) is not the probability

    of an event: it can even be

    greater than one.

    1 Positive for all x, fX(x) 0 for all x2 fX(x)dx = P( < X < ) = 1

    P([x, x + x]) = x+xx

    fX(t)dt fX(x)x where fX(x) = probability mass per unitlength near x.

    Expectation and Moments for continuous RVs

    E[X] =

    xfX(x)dx (A.44)

    The nthmoment is E[Xn]. The variance isE[g(x)] = g(x)fX(x)dx,

    g(x) may be continuous or

    discrete.

    Var [X] = E(X E[X])2 = EX2 (E[X])2 =

    (x E[X])2fX(x)dx > 0

    If Y = aX+ b, E[Y] = aE[X] + b and Var [Y] = a2Var [X]

    Discrete Uniform law: Sam-

    ple space had finite number

    of equally likely outcomes.

    For discrete variables, we

    count the number of out-comes concerned with an

    event. For continuous vari-

    ables we compute the length

    of a subset of the real line.

    Continuous uniform random variable

    The PDF of a continuous uniform random variable is

    fX(x) = c, a x b0, otherwise

    (A.45)

    where c must be > 0. For f to be a PDF, 1 =ba

    c dz = cba

    dz = c(b a) c = 1/(b a)

  • 7/30/2019 Univariate Stats

    15/52

    A15 Continuous probability distributions

    Piecewise constant PDF, general form:

    fx(x) =

    ci if ai x ai+1, i = 1, 2,.....,n 10 otherwise

    1 =ana1 fX(x)dx =

    n1i=1

    anai cidx =

    n1i=1 c

    i(ai1 ai)

    fX(x) =

    1/2

    x if 0 < x 1

    0 otherwise

    fX(x) =

    10

    1

    2

    xdx =

    x10

    = 1

    Mean and variance of the uniform random variable:

    0 5 10 20 30

    0.

    0

    0.

    4

    0.

    8

    x

    fx C1

    C2C3

    Fig. A.11: The piecewise con-

    stant distribution.E[X] =

    xfX(x)dx = b

    a

    x1

    b

    a

    dx =1

    b

    a

    12

    x2b

    a

    =1

    b a b2 a2

    2=

    a + b

    2

    The PDF is symmetric around (a + b)/2)

    EX2 = ba

    x21

    b a dx =1

    b a x3

    3|ba =

    b3 a33(b a) =

    a2 + ab + b2

    3

    Var [X] = EX2 (E[X])2 = (a2 + ab + b2)3

    a + b

    2

    2=

    (b a)212

    Example A.10 Let [a, b] = [0, 1] and let g(x) =

    1 if x 1/32 if x > 1/3

    1. Discrete: Y = g(X) has PMF

    PY(1) = P(X 1/3) = 1/3PY(2) = P(X > 1/3) = 2/3

    E[Y] = 13

    1 + 23

    2 = 53

    2. Continuous:

    E[Y] =10

    g(x)fx(x)dx =

    1/30

    1dx +

    2/31/3

    2dx =5

    3

    Exponential continuous distribution

    PDF of X is

    fX(x) =

    ex if x 0

    0 otherwise(A.46)

    Is it a PDF?

    fX(x)dx =

    0

    exdx = ex0

    = 1

    For any a 00 20 40 60 80

    0.

    00

    0.

    02

    0.

    04

    x

    fx

    Fig. A.12: The exponential dis-tribution with = 0.05.

    P(X a) =

    aexdx = exa = ea

    Examples: Time till a light bulb burns out. Time till an accident.

  • 7/30/2019 Univariate Stats

    16/52

    A16 Introduction to Probability and Statistics

    Mean

    E[X] =0

    x(ex)dx = (xex)|0 +0

    exdx = 0 ex|0

    =

    1

    Variance

    EX2 = 0

    x2(ex)dx = (x2ex)|0 + 0

    2xexdx = 0 + 2E[X] = 22

    Var [X] =2

    2

    1

    2=

    1

    2

    Example A.11 The sky is falling... Meteorites land in a certain area at an average of

    10 days. What is the probability that the first meteorite lands between 6 AM and 6 PM of

    the first day given that it is now midnight.

    Soln:

    Let X = elapsed time till strike (in days) = exponential variable

    mean = 1/ = 10 days

    = 1/10.

    P

    1

    4 X 3

    4

    = P

    X 1

    4

    P

    X 34

    = e(1/4) e(3/4) = e1/40 e3/40 = 0.047

    Probability that meteorite lands between 6 AM and 6 PM of some day:

    For kth day, this time frame is (k 3/4) X (k 1/4)

    =k=1

    P

    k 3

    4

    X

    k 1

    4

    =

    k=1

    P

    X

    k 3

    4

    k=1

    P

    X k 1

    4

    = k=1

    (e(4k3)/40 e(4k1)/40)

    Cumulative distribution function (CDF)

    CDF of X

    FX = P(X x) =

    kx PX(k) X : Discretex fX(t)dt X : Continuous

    (A.47)

    For the discrete form the CDF will look like a staircase.

    Some properties of CDFs FX(x)

    FX(x) is monotonically nondecreasing: if x y, then FX(x) FX(y). FX(x) 0 as x , and FX(x) 1 as x +. If X is discrete, FX is piecewise constant (staircase). If X is continuous, FX is continuous. If X is discrete:

    CDF from PMF:

    FX(k) =k

    i=PX(i)

    PMF from CDF: (for all k)

    PX(k) = P(X k) P(X k 1) = FX(k) FX(k 1)

  • 7/30/2019 Univariate Stats

    17/52

    A17 Continuous probability distributions

    If X is continuous: CDF from PMF:

    FX(x) =

    x

    fX(t)dt

    PMF from CDF:

    fX(x) =dFX

    dx

    x

    Geometric vs. Exponential CDFs

    Let X = geometric random var. with parameter p. For example, X = number of trials before

    first success with probability of success per trial = p.

    P(X = k) = p(1 p)k1, for k = 1, 2, .....

    FgeoX (n) =

    nk=1

    p(1 p)k1 = p 1 (1 p)n

    1 (1 p) = 1 (1 p)n, n = 1, 2...

    Let Y = exponential random var, > 0

    2 4 6 8 10

    0.

    0

    0.

    2

    0.

    4

    x

    fx

    Fig. A.13: Geometric vs. expo-

    nential distributions (p = 0.15

    and = 0.185).

    FexpY (x) =x0

    etdt = et|x0 = 1 ex, for x > 0

    Comparison: Let

    = ln(1 p)

    e = 1 p Exponential and geometric CDFs match for all x = n, n = 1, 2,...

    Fexp(n) = Fgeo(n), n = 1, 2...

    Normal (Gaussian) random variables

    Most useful distribution: use it to approximate other distributions

    PDF of X = fX(x) =12

    e(x)2/22 < x < (A.48)

    for some parameters & ( > 0).

    Normalization confirms that it is a PDF:

    fX(x)dx =12

    e(x)2/22dx = 12 0 2 4 6 8

    0.0

    0.1

    0.2

    0.3

    0.4

    x

    Density

    ba

    fx

    Fig. A.14: Normal distribu-

    tion.

    E[X] = mean = (from symmetry arguments alone). Var [X] =

    (x E[X])2fX(x)dx =

    (x )2 12

    e(x)2/22dx

    Let z = (x )/

    Var [x] =22

    z2ez2/2dz =

    22

    (zez2/2)

    +

    22

    ez2/2dz

    =22

    ez2/2dz = 2

    X = normal var with and 2

    = (mean,var). Height of N(, ) = 1/2 and hence height 1/. If Y = aX + b then E[Y] = aE[X] + b and Var [Y] = a22 where X and Y areindependent.

  • 7/30/2019 Univariate Stats

    18/52

    A18 Introduction to Probability and Statistics

    Standard Normal Distribution

    A normal distribution of Y with mean = 0 and 2 = 1; N(, 2) = N(0, 1) = standardnormal

    CDF of Y = (y) = P(Y y) = P(Y < y) = 12

    y

    et2/2dt (A.49)

    (y) is available as a table. If the table only gives the value of (y)for y 0, use symmetry.

    Example A.12 (0.5) = P(Y 0.5) = P(Y 0.5) = 1 P(Y 0.5) = 1 (0.5) =1 0.6915 = 0.3085

    Given a normal random var

    X with mean and variance

    2, use

    Z =(x )

    Then

    E[Z] = (E[X] )

    = 0

    Var [Z] =Var [X]

    2= 1

    Z = standard normal.

    Example A.13 Average rain at a spot = normal distribution, mean = = 60 mm/yr and

    = 20 mm/yr. What is the probability that we get at least 80 mm/yr?

    Soln:Let X = rainfall/yr. Then Z = X / = (X 60)/20. So

    P(X 80) = P(Z 80 6020

    ) = P(Z 1) = 1 (1) = 1 0.8413 = 0.1587

    In general

    P(X x) = P

    X

    x

    = P

    Z x

    =

    x

    where Z is standard normal. 100 uth percentile ofN(0, 1) = Zu and P(X < Zu) = u whereX

    N(0, 1)

    Approximating a binomial with a normal

    Binomial: Probability of k successes in n trials, each trial has success probability of p. The

    binomial PMF is painful to handle for large n, so we approximate. E[X] = np and Var [X] =

    npq the equivalent Normal distribution is N(np, npq).Consider a binomial with mean = np and variance = npq, with a range of interest from

    a X b. It may be approximated with a normal, N(np, npq) from a 1/2 X b + 1/2.For the special case: a = b, P(X = a) = area under normal curve from a 1/2 to a + 1/2.

    At the extremes:

    Binomial Normal

    P(X = 0) area to the left of 1/2

    P(X = n) area to the right of n

    1/2Use when npq

    5, n, p, q

    all moderate.

    Approximating the Poisson with a normal

    Poisson is difficult to use for large . PX(X = x) N(, ) where for the Poisson, mean =var = . For the Normal, we take the area from x 1/2 to x + 1/2 or the area to left of 1/2for x = 0. Use this approximation for 10.

    Summary of approximations

    1. Approximating Binomial with Normal:

    Rule of Thumb: npq 5, n, p, q all moderately valued.2. Approximating Poisson with Normal:

    10

    3. Approximating Binomial with Poisson: n 100, p 0.01(Obviously, both Poisson and Normal may be used for n 1000, p = 0.01)

  • 7/30/2019 Univariate Stats

    19/52

    A19 Continuous probability distributions

    Beta distribution

    The Beta distribution has as its pdfUnlike the binomial, a and b

    need not be integers. PMF(x) =(a + b)

    (a)(b)xa1(1 x)b1 0 x 1 (A.50)

    (a + 1) = a(a) in general and when a is an integer, using this relationship recursively givesWhen a = b = 1, Beta(x;1,1)

    = Uniform distribution.

    us (a + 1) = a!. It is important to appreciate that by varying a and b, the pdf could attain

    various shapes. This is particularly useful when we look for a pdf of an appropriate shape.

    E[x] =10

    xf(x)dx =(a + b)

    (a)(b)

    10

    xa(1 x)b1dx

    =(a + b)

    (a)(b)

    (a + 1)(b)

    (a + b + 1)

    10

    (a + b + 1)

    (a + 1)(b)xa(1 x)b1dx

    =(a + b)

    (a)(b)

    (a + 1)(b)

    (a + b + 1)=

    a

    a + b

    Similarly

    Ex2 = a(a + 1)(a + b + 1)(a + b)

    Var [x] =ab

    (a + b)2(a + b + 1)

  • 7/30/2019 Univariate Stats

    20/52

    BAppendix B Estimation and

    Hypothesis Testing

    B.1 Samples and statistics

    Theres no sense in being

    precise when you dont even

    know what youre talking

    about. John von Neumann

    We sample to draw some conclusions about an entire population.Examples: Exit polls, census.

    We assume that the population can be described by a probability distribution. We assume that the samples give us measurements following this distribution,

    We assume that the samples are randomly chosen. It is very difficult to verify that a sample

    is really a random sample. The population is assumed to be (effectively) infinite.

    Sampling theory must be used if the population is actually finite. Simple random sample: each member of a random sample is chosen independently and has

    the same (nonzero) probability of being chosen.

    Cluster sampling: If the entire population cannot be uniformly sampled, divide it intoclusters and sample some clusters.

    Clinical Trials and Randomization

    Block randomization: one block gets the treatment and another the placebo. Blind trials: Patients do not know what they get. Double blind trials: both doctors and patients do not know what they get.Distributions and Estimation

    We wish to infer the parameters of a population distribution. mean, variance, proportion etc.

    We use a statistic derived from the sample data. Sample mean, sample variance etc.

    A statistic is a random variable. What distribution does it follow?

    What are its parameters?

    Notation

    Population parameters are typically in Greek. = population mean.

    2 = population variance.

    The corresponding statistic is not in Greek! x = sample mean.

    s2 = sample variance.

    We can also denote an estimate of the parameter as .

    Regression:

    You want: y = x + .

    You get: y = ax + b.

    B1cSBN, IITB, 2012

  • 7/30/2019 Univariate Stats

    21/52

    B2 Estimation and Hypothesis Testing

    Why do we infer using the sample mean?

    Consider a population with mean and variance 2. We have n measurements, x1, x2, . . .,

    xn.

    x =x1 + + xn

    n(B.1)

    Its all about expectationsE[xi] = Var [xi] = 2 (B.2)

    E[x] = E

    x1 + + xnn

    =

    E[x1] + + E[xn]n

    =n

    n= (B.3)

    E[xi] = E[x] = (B.4)

    x is an unbiased estimator of . For a statistic to be unbiased

    E

    = (B.5)

    The population mean has other unbiased estimators such as the sample median, (min +

    max)/2 and the trimmed mean. x is not a robust estimator as it is sensitive to outliers.

    Nevertheless x is the preferred estimator of . x is the estimator with the smallest variance.

    What is the variance of x?

    We had Var [xi] = 2.

    Var[x] = Var x1 + + xnn

    =

    1

    n2(Var [x1 + + xn])

    =1

    n2(Var [x1] + + Var [xn]) Independence!

    =1

    n2n2 =

    2

    n(B.6)

    xi or x?

    E[xi] = Var [xi] = 2 (B.7)

    E[x] = Var [x] = 2n

    (B.8)

    If x N(0, 1), then x N

    0,1

    n

    The sample mean depends on n. Therefore we must sample more and gain accuracy!

    Fig. B.1: The distribution of x

    Intuitively, we desire replica-

    tion of results.

    The population size can matter

    For an infinite population, Var [x] = 2/n. For a finite population of size N, we need a

    correction factor.

    More on this in a separate

    handout.

    Var [x] =2

    n

    N nN 1 (B.9)

  • 7/30/2019 Univariate Stats

    22/52

    B3 Samples and statistics

    The Central Limit Theorem

    Theorem B.1 The sum of a large number of independent random variables has a distribu-

    tion that is approximately normal.

    Let x1, . . ., xn be i.i.d. with mean and variance 2. For large n

    (x1 + + xn) N(n, n2) (B.10)

    P

    (x1 + + xn) n

    n

    P(z < x) (B.11)

    Consider the total obtained on rolling several (n) dice (Fig. B.2). As n increases, the distri-

    bution changes to a Gaussian curve!

    Fig. B.2: Rolling n dice.

    CLT and the Binomial

    Let X = x1 + + xn, where xi is a Bernoulli variable.E[xi] = p Var [xi] = p(1 p) (B.12)

    E[X] = np Var [X] = np(1 p) (B.13)

    CLT suggests that for large n

    X npnp(1 p) N(0, 1) (B.14)

    The Binomial can be approximated with a normal if np(1 p) 10. For p = 0.5, npq 10implies n 40! In general, the normal approximation may be applied if n 30. So keep

    Remember: even if an in-

    dividual measurement does

    not follow a Normal distri-

    bution, the measured sample

    mean does.

    sampling!

    Sample variance

    The population variance is

    2 =

    Ni=1(xi )2

    N(B.15)

    The sample variance is

    s2 =ni=1(xi x)2

    n 1 (B.16)

    We had E[x] = . But is ES2 = 2? We compute ES2 as follows:Why n 1?

    (xi x)2 = (xi + x)2= (xi )2 + ( x)2 + 2(xi )( x)

    ni=1

    (xi x)2 =ni=1

    (xi )2 +ni=1

    ( x)2 + 2ni=1

    (xi )( x) (B.17)

    The middle term is ...

    ni=1

    ( x)2 = n( x)2

  • 7/30/2019 Univariate Stats

    23/52

    B4 Estimation and Hypothesis Testing

    The last term is ...

    2

    ni=1

    (xi )( x) = 2(x )ni=1

    (xi ) = 2n(x )2

    n

    i=1

    (xi

    x)2 =n

    i=1

    (xi

    )2 + n(

    x)2

    2n(x

    )2

    =ni=1

    (xi )2 n(x )2ni=1(xi x)2

    n=

    ni=1(xi )2

    n (x )2

    Taking the expectation on each side

    En

    i=1(xi x)2n

    = En

    i=1(xi )2n

    E(x )2 (B.18)

    But,

    En

    i=1(xi )2n

    = 2 (B.19)

    E(x )2 = Var [x] = 2n

    (B.20)

    This implies

    En

    i=1(xi x)2n

    = 2

    2

    n=

    (n 1)2n

    (B.21)

    We have a biased estimator! This means that

    E

    ni=1(xi x)2n 1

    = 2 (B.22)

    Therefore we define an unbiased estimator s2 as

    s2 =

    ni=1(xi x)2

    n 1 (B.23)

    We lost a degree of freedom.

    Where? 2 =

    Ni=1(xi )2

    ns2 =

    ni=1(xi x)2

    n 1

    B.2 Point and Interval Estmation

    Point estimates

    Given xi N(, 2), the preferred point estimate of is x. The preferred point estimate of2 is s2. We know how x behaves:

    x N

    ,2

    n

    x /

    n

    N(0, 1) (B.24)

    We do not know how s2

    behaves yet. Similarly, intuition leads us to the best estimate ofthe binomial proportion p as p = X/n. The maximum likelihood estimation approach below

    provides a method for the determination of parameter point estimates.

  • 7/30/2019 Univariate Stats

    24/52

    B5 Point and Interval Estmation

    Optimal estimation approaches

    The MLE of a parameter is that value of that maximizes the likelihood of seeing the

    observed values, i.e. P(D|). Note that instead we could desire to maximize P(|D): i.e. wechoose the most probable parameter value given the data that we have already seen. This is

    Maximum a posteriori (MAP) estimation. The MAP estimate is typically more difficult to

    obtain because we need to rewrite it in terms of P(D|) using Bayes law, and p() needs to bespecified. Also note that if P() is effectively a constant (i.e. is a uniform RV), then MLE

    = MAP. There are other approaches towards obtaining estimates of parameters including

    the confusingly named Bayesian estimation (different from MAP), maximum entropy etc.

    In general note that we want estimates with small variances, and hence minimum variance

    estimates are popular. In the discussion that follows, we use almost always use MLE; we will

    however encounter MAP at least once later.

    MLE of the Binomial parameter p:

    Let X = random variable describing a coin flip, X = 1 if heads, X = 0 if tails. For one coin

    flip,

    As before, the Bernoulli and

    the Binomial.f(x) = px(1 p)1x (B.25)

    For n coin flips,

    f(x1, x2,...,xn) =

    ni=1

    pxi(1 p)1xi (B.26)

    The log-likelihood may be written as

    l(p|xi,...,xn) =ni=1

    [xi lnp + (1 xi)ln(1 p)] (B.27)

    The MLE of p = p and is obtained from

    The answer here is obvious.MLE does this quite often:

    it gives us intuitive answers.

    However as the next exam-

    ple shows, MLE could be bi-

    ased.

    d l(p)

    dp= 0

    xi

    p

    (1 xi)1 p = 0

    which can be rearranged to give

    p =

    ni=1 xin

    (B.28)

    The MLE of the binomial proportion (X heads in n tosses) is

    p =X

    nand E[p] = p (B.29)

    An MLE perspective on x and s2, for a Gaussian

    Let D = {x1, . . . , xn} be a dataset of i.i.d. samples from N(, 2). Then the multivariatedensity function is

    f(D) =ni=1

    e(xi)2/22

    2=

    1

    n(2)n/2exp

    1

    2

    ni=1

    xi

    2(B.30)

    f(D) is the likelihood of seeing the data. l = ln[f(D)] is the log likelihood of seeing the data.

    Which estimates of and 2 would maximize the likelihood of seeing the data? Set

    l

    = 0

    l

    2= 0 (B.31)

    and solve for and 2.

    l

    = 0

    ni=1(xi )2

    = 0 =

    i xin

    = x (B.32)

  • 7/30/2019 Univariate Stats

    25/52

    B6 Estimation and Hypothesis Testing

    The MLE of is = x. The MLE of 2 however is a surprise:

    l

    2= 0 = n

    22+

    1

    22ni=1(xi )2

    2

    1

    22 ni=1(xi )2

    2 =

    1

    24

    ni=1

    (xi )2

    =1

    24

    ni=1

    [(xi x) + (x )]2

    =1

    24

    ni=1

    (xi x)2 + n(x )2 + 2(x )ni=1

    (xi x)

    =1

    24

    nS2 + n(x )2 where S2 is defined as = ni=1(xi x)2n

    l

    2= 0 =

    n

    22+

    1

    22 ni=1(xi )2

    20 = n

    22+

    n

    24

    S2 + (x )2But x = from before, and hence

    The MLE of 2 is biased!

    2ML = S2 =

    ni=1(xi x)2

    n(B.33)

    Interval Estimation

    We have point estimates of and 2. But remember that x and 2 are random variables. We

    could use bounds: for example, x =

    . should depend on 2.

    Chebyshevs inequality for a random variable x

    Theorem B.2

    P(|xi | > k) 1k2

    (B.34)

    2 = E(x )2 = R1,R2,R3

    (xi E[x])2p(xi)

    R1

    (xi E[x])2p(xi) +R3

    (xi E[x])2p(xi)

    R1

    k22p(xi) +R3

    k22p(xi)

    1

    k2R1

    p(xi) +R3

    p(xi)4 2 0 2 4 6

    0.

    0

    0.

    1

    0.

    2

    0.

    3

    0.

    4

    k + k

    Fig. B.3: Chebyshevs theorem P(|xi | > k) 1k2

    This applies to ANY distribution. For k = 2,

    Chebyshevs theorem suggests: P(|xi | > 2) 0.25.

    For a Gaussian, P(

    |xi

    |> 2)

    0.05.

    Clearly, if the density function is known, we should be using it instead of Chebyshevs theorem,

    to determine probabilities.

    3 x 2 + + 2 + 3

    68%

    95%

    99.7%

    f (x)

    i . .

    Fig. B.4: Intervals on a Gaus-

    sian

  • 7/30/2019 Univariate Stats

    26/52

    B7 Point and Interval Estmation

    Chebyshevs inequality for the derived variable x

    P

    |x | > k

    n

    1

    k2(B.35)

    Let k/

    n = = some tolerance.

    P(|x | > ) 2

    n2 P(|x | < ) 1

    2

    n2(B.36)

    This is the law of large numbers: For a given tolerance, as n , then X .

    Which estimate intervals do we want?

    Chebyshevs inequality gives us a loose bound. We want better estimates. For a normal

    distribution,

    1. interval estimate of with 2 known,

    2. interval estimate of with 2 unknown,

    3. interval estimate of2 with unknown

    For a binomial distribution,

    4. interval estimate ofp

    1. Normal: interval for , 2 known

    We have n normally distributed points.

    x N(, 2)

    x N

    ,2

    n

    x /

    n

    N(0, 1)

    Recall the z notation for a threshold:

    P(z > Z) =

    For a two-sided interval,

    Fig. B.5: One sided interval

    Fig. B.6: Two sided intervals

    P

    z/2 < x /n < z/2

    = 1 (B.37)

    P

    1.96 < x

    /

    n< 1.96

    = 0.95 (B.38)

    We can rearrange the inequalities in

    Pz/2 < x /n < z/2 = 1

    to get

    P

    x z/2 n < < x + z/2

    n

    = 1 (B.39)

    So what does a confidence interval x z/2 n mean?: The maximum error of the estimateis |X | < z/2/

    n with probability 1 . What does a confidence interval mean for a

    Frequentist? The two-sided confidence interval at level 100(1 ) was3 2 1 0 1 2 30.5

    0.0

    0.5

    1.0

    value

    Probability

    | |

    | |

    | |

    | |

    | |

    Fig. B.7: Confidence intervals x z/2 n (B.40)

    We can have lower and upper one sided intervals.x z

    n,

    (B.41)

  • 7/30/2019 Univariate Stats

    27/52

    B8 Estimation and Hypothesis Testing

    , x + z

    n

    (B.42)

    What should the sample size (n) be, if we desire x to approach to within a desired level of

    confidence (i.e. given )? Rearrange and solve for n:

    n =

    z/2|x |2

    (B.43)

    Does the dependency of n on the various terms make sense? You need more samples if

    1. you want x to come very close to ,

    x

    E = error = x

    u = x + z/2 / n l = x z/2 / n

    tFig. B.8 Intervals on the Gaussian

    2. is large,

    3. z/2 is increased (i.e. is decreased)

    3 x 2 + + 2 + 3

    68%

    95%

    99.7%

    f (x)

    i . .Fig. B.9: Effect of

    2. Normal: interval for , 2 unknown

    The two sided interval for , when 2 is known is x z/2/

    n. What happens when 2 is

    unavailable? We must estimate 2. Its point estimate is s2. However, theres a problem w.r.t.

    interval estimation:

    x /

    n

    follows the normal N(0, 1)

    Butx s/

    n

    does not follow N(0, 1)

    In fact, we need a new sampling distribution to describe how (x )/(s/n) varies. We needto acknowledge that our estimate s2 is a random variable (RV), with its own distribution

    centered around mean 2. Therefore we have a range of values for s2 which could be used

    to substitute for 2 in N(, 2) and therefore we must average across several Gaussians,which have the same mean but with different variances. The new distribution must have the

    following properties:

    The Gaussians are symmetric about : therefore a weighted average of Gaussians must be

    symmetric about .

    Tail: it should have a thick tail, for small n: greater variability than N. As n , we should be back to N(, 2). The distribution is a function of sample size n. We have a family of curves!

    William Gosset: a chemist

    at the Guinness breweries in

    Dublin, came up with the

    Students t-distribution!

    We know ofN(x|, 2). But 2 is unknown, and if we estimate it, we end up with a distributionof variance values. It is easier to work with the precision :

    =1

    2(B.44)

    The univariate Normal distribution is

    f(x) =12

    exp

    1

    2

    x

    2=

    1/22

    exp

    2(x )2

    (B.45)

  • 7/30/2019 Univariate Stats

    28/52

    B9 Point and Interval Estmation

    We have a dependency on of the form

    1/22

    exp

    2(x )2

    The Gamma distribution is

    Gam(x|, ) =

    x

    1

    () exp(x) x > 0, > 0, > 0 (B.46)where

    () =

    0

    eyy1dy (B.47)

    () = ( 1)( 1). If is an integer, (n) = (n 1)!. For a Gamma variable, the = 1: (1) = 1.

    The Gamma distribution be-

    comes an exponential distri-

    bution.

    moment generating function is

    (t) = Eetx = ()

    0

    etxexx1dx =

    t

    (B.48)

    E[x] =

    (0) =

    (B.49)

    Var [x] = =

    (0) (E[x])2 = 2

    (B.50)

    The precision follows a Gamma distribution. We had

    Fig. B.10: Gamma distribution

    N(x|, 2) =N(, 1) (B.51)We want the average of several distributions:

    EN(, 1) (B.52)

    which is

    0 Nx

    |, 1Gam(|

    , )d (B.53)

    This averaged distribution, can be shown to be (details in another handout on Moodle)

    + 12

    ()

    1

    2

    1/2 1 +

    (x )22

    12

    (B.54)

    This is the Students t-distribution, St(x|,,) where 2 signifies the degrees of freedom.

    Fig. B.11: The t distribution

    Fig. B.12: The t and N(0, 1)distributions

    Notice the symmetry about 0 in Figs B.11 and B.12. Quantiles corresponding to the 95%

    t0k,t k,t1 k, t=

    i

    Fig. B.13: Intervals on the t dis-

    tribution

    confidence interval for the t-distribution, as a function of the degrees of freedom, are shown

    in Table B.1. For n 30, t N(0, 1). Now

    t0.025,

    4 2.776

    9 2.262

    x /

    n

    follows the N(0, 1) curve, but x s/

    n

    follows the tn1 curve

    The interval for when 2 was known, was

    x n

    z/2

    The interval for when 2 is unknown is

    x sn

    t/2,n1 (B.55)

    Similarly, lower and upper one sided intervals arex t,n1 s

    n,

    (B.56)

    , x + t,n1 s

    n

    (B.57)

  • 7/30/2019 Univariate Stats

    29/52

    B10 Estimation and Hypothesis Testing

    Example B.3 A fuse manufacturer states that with an overload, fuses blow in 12.40 minutes

    on average. As a test, take 20 sample fuses and subject them to an overload. A mean time of

    10.63 and a standard deviation of 2.48 minutes is obtained. Is the manufacturer right?

    Soln:

    t = x s/n = 10.63 12.402.48/20 = 3.19

    From a t table,

    P(t19 > 2.861) = 0.005 P(t < 2.861) = 0.005Hence the manufacturer is wrong and the mean time is probably (with at least 99% confidence)

    below 12.40 minutes.

    3. Normal: interval for 2

    If (x1, . . . , xn)

    N(, 2). The point estimate of 2 is s2

    s2 =

    ni=1(xi x)2

    n 1 and E

    s2

    = 2 (B.58)

    We have

    zi =(xi )

    N(0, 1) (B.59)

    Let

    X = z21 + + z2n (B.60)X follows a 2 distribution with n degrees of freedom.

    X 2n (B.61)The 2 with n degrees of freedom is related to the Gamma distribution. The 2 distribution0 5 10 15 20 25 x

    k = 10

    k = 5

    k = 2

    f(x)

    i

    Fig. B.14: The 2 distribution.has x always > 0 and is skewed to the right. We have

    It is not symmetric. The skew

    as n .

    ni=1(xi )2

    2 2n (B.62)

    When we substitute with x ni=1(xi x)2

    2 2n1 (B.63)

    But

    s2 =

    ni=1(xi x)2

    n 1(n 1)s2

    2 2n1 s2

    2

    n 1 2n1 (B.64)

    We compute intervals, using quantiles, as before. No symmetry we need to evaluate bothquantiles for a two sided interval.

    (n 1)s22

    2n1 (B.65)

    P

    (n 1)s22/2,n1

    < 2 0

    H1 : = 1 (= 0)

    Do we need to pay attention to H1? Yes. H1 influences the interval estimate. To see that we

    need to understand what means. But before that, why should H0 be the equality statement?Because the moment we say something like H0 : = 0 we have a specific distribution to

    work with! The equality statement carries the burden of proof. The legal system prefers that4 2 0 2 4

    0.0

    0.

    1

    0.

    2

    0.

    3

    0.

    4

    Ho Accepted

    0

    Fig. B.16: Hypothesis accep-

    tance

    we try and disprove H0 : X is innocent. Typically, the accused is innocent (H0) until proven

    guilty (H1) beyond reasonable doubt ().

    What is ?

    The error of the estimate is |x |.This maximum error is < z/2/

    n with probability 1 There are 4 possible outcomes to

    Fig. B.17: The error

    a test of H0 vs. H1.

    1 Accept H0 as true, when in fact it is true.

    2 Accept H0 as true, when in fact it is not true (i.e. H1 is true).

    3 Reject H0 as true, when in fact H0 is true.

    4 Reject H0 as true, when in fact H0 is not true (i.e. H1 is true).

    Truth

    H0 H1

    Decision H0 true, H0 accepted H1 true, H0 accepted

    Taken H0 true, H1 accepted H1 true, H1 accepted

    Type I error: H0 true, but H1 accepted = Remember that = signif-

    icance level of the test ap-

    plied.Where theres an ...

    ...theres a . Type II error: H1 true, but H0 accepted = . Assuming H1 : = 1, with

    1 > 0, 1 is the power of the test. depends on 1. Increase |01| and 1 increases

    H0 H1

    2 2

    0 1

    Fig. B.18: and

    powerful test. We accept H0 in the range

    =|x 0|

    /

    n z/2 = 0 z/2 n x 0 + z/2

    n

    (B.85)

    But is the area in this range, under H1. We need to convert the thresholds to standardized

    coordinates first, this time under H1. Under H1,

    x N

    1,2

    n

    (B.86)

  • 7/30/2019 Univariate Stats

    33/52

    B14 Estimation and Hypothesis Testing

    The location 0 z/2/

    n becomes (under H1)

    =

    0 z/2/

    n 1

    /

    n=

    0 1/

    n

    z/2 (B.87)

    Similarly, the other threshold is

    Fig. B.19: The acceptance in-terval

    =0

    1

    /n + z/2 (B.88)Therefore

    = P

    0 1/

    n

    z/2 z 0 1/n + z/2

    =

    0 1/

    n

    + z/2

    0 1/

    n

    z/2

    (B.89)

    If 1 > 0, then the second term 0.

    0 1/

    n

    + z/2

    = P(z > z) = P(z < z) = (z) (B.90)

    z (0 1)

    n

    + z/2 (B.91)

    n (z/2 + z)22

    (1 0)2 (B.92)

    The sample size depends on

    1. |0 1|. As |0 1| increases, n decreases.

    46

    0.3

    0.4

    0.5

    0.6

    0.2

    0.1

    048 50 52 54 56

    UnderH0: = 50 UnderH1: = 52

    Probabilitydensity

    x

    i . .

    46

    0.3

    0.4

    0.5

    0.6

    0.2

    0.1

    048 50 52 54 56

    UnderH0: = 50

    UnderH1: = 50.5

    Probabilitydensity

    x

    i . .

    tFig. B.20 Effect of |0 1|

    2. 2. As 2 increases, n increases.

    4 2 0 2 4 6 8

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    Ho H

    4 2 0 2 4 6 8

    0.0

    0.1

    0.2

    0.3

    0.4

    Ho H

    tFig. B.21 Effect of 2

    3. . As decreases, z/2 increases, hence n increases.

    4 2 0 2 4 6 80.1

    0.0

    0.1

    0.2

    0.3

    0 1

    4 2 0 2 4 6 80.1

    0.0

    0.1

    0.2

    0.3

    0 1

    4 2 0 2 4 6 80.1

    0.0

    0.1

    0.2

    0.3

    0 1

    tFig. B.22 Effect of

    4. . As required power increases (1 increases), n increases.

  • 7/30/2019 Univariate Stats

    34/52

    B15 Hypothesis Testing

    4 2 0 2 4 6 80.1

    0.0

    0.1

    0.2

    0.3

    0 1

    4 2 0 2 4 6 80.1

    0.0

    0.1

    0.2

    0.3

    0 1

    tFig. B.23 Effect of

    One-sided vs. two-sided tests

    Consider the null hypothesis H0 : = 0: When H0 : = 0 and H1 : > 0, we should

    (a)

    0

    N(0,1)

    z /2z /2 Z0

    /2/2 Acceptance

    region

    Critical region

    (c)

    0

    N(0,1)

    z Z0

    Acceptance

    region

    (b)

    0

    N(0,1)

    z Z0

    Critical region

    Acceptance

    region

    Critical region

    i . . ,

    H1 : = 0 H1 : > 0 H1 : < 0

    tFig. B.24 One-sided vs. two-sided tests

    Which is more conservative:

    a two-sided test or a one-

    sided test?

    accept H0 ifx 0/

    n

    z (B.93)

    reject H0 ifx 0/

    n

    > z (B.94)

    p-value

    The p-value is the significance level when no decision can be made between accepting andrejecting H0. Find the probability of exceeding the test statistic (in absolute value) on a unit

    normal.

    p-value = p

    |z| > |x 0|

    /

    n

    (B.95)

    4 2 0 2 4

    0.

    0

    0.

    1

    0.

    2

    0.3

    0.4

    Fig. B.25: The p valuep-value and one-sided tests

    For H0 : = 0 and H1 : = 0, we would reject if the test statistic, t is > z/2.

    p-value = 2 p(z |t|) For H0 : 0 and H1 : > 0, we would reject if the test statistic, t is > z.

    p-value = p(z t)

    For H0 : 0 and H1 : < 0, we would reject if the test statistic, t is < z.

    p-value = p(z t)

    Why is it important to re-

    port the p-value? Why not

    just report the test result

    (yes/no) based on the chosen

    ?

    When you perform a hypothesis test, you should report

    1. the significance level of the test that you chose,

    2. AND the p-value associated with your test statistic.

  • 7/30/2019 Univariate Stats

    35/52

    B16 Estimation and Hypothesis Testing

    A problem solving tip

    For 1 > 0, and for a two-sided test,

    0 1/

    n

    + z/2

    (B.96)

    Observe that

    = f(|0 1|, 2, n, ) (B.97)

    The sample size n is

    n (z/2 + z)22

    (1 0)2 (B.98)

    and depends on

    n = f(|0 1|, 2, , ) (B.99)

    Most problems require us to solve for one of the following in terms of the rest:

    |0 1|, 2, , , n

    , power and rejection of H0As we increase 1, decreases. The curve described by (1) is called an operating charac-

    teristic, (OC). For a fixed value of , plot

    (1) vs.|0 1|

    /

    n(B.100)

    The Operating Characteristic curve for a two-sided test with = 0.05.

    (1) vs.|0 1|

    /

    n

    Fig. B.26: The OC curve.

    One sided tests and

    We had, when H0 : = 0 and H1 : > 0,

    (1) = P (accepting H0 | H1 : = 1 is correct) (B.101)

    We switch to the more general notation, replacing 1 with

    beta() = P (accepting H0 | H1 is correct) (B.102)

    4 2 0 2 4 6 80.

    1

    0.0

    0.

    1

    0.2

    0.

    3

    0 1

    Fig. B.27: Finding .

    () = P

    x 0 + z

    n

    = P

    x /

    n

    0 /

    n

    + z

    = P

    z 0

    /

    n+ z

    , z N(0, 1)

    =

    0 /

    n

    + z

    (B.103)

    Compare this against for a two sided interval: z/2 has been replaced with , and 1 has

    been replaced with the more generic . As its argument increases, increases. Therefore as

    increases, () decreases. The larger is, the less likely it would be to conclude that

    = 0, or for that matter 0.

    (0) = 1

  • 7/30/2019 Univariate Stats

    36/52

    B17 Parametric hypothesis testing

    H0 and H1To test H0 : =0 and H1 : > 0, we compared

    x 0/

    n

    vs. z

    If instead, we wished to test H0 :

    0 and H1 : > 0, what changes must we make to the

    test procedure? How did H0 : = 0 influence the testing procedure?

    We put down the H0 curve centered around 0, and then chose and identified an interval based on it.We meant that the probability of rejection of H0, should be (the small valued) . As we

    relax H0 to now be 0 we need to ensure that the probability of rejection of H0 is nevergreater than . But the probability of accepting H0 for a given is (). Therefore, for the

    probability of rejection to be never greater than , we would need

    1 () for all 0

    which implies that we would need

    () 1 for all 0But, as seen before, () decreases as increases. Therefore () (0). But (0) = 1,

    4 2 0 2 4 6 80.1

    0.

    0

    0.

    1

    0.

    2

    0.3

    0 1

    Fig. B.28: and and therefore we confirm that

    () = 1 for all 0Therefore. a test for H0 : = 0 at level also works for H0 : 0 at level .

    B.4 Parametric hypothesis testing

    Overview of Parametric Tests discussed

    1. Testing the mean of a Normal Population variance known (z-Test)

    2. Testing the mean of a Normal Population variance unknown (t-Test)

    3. Testing the variance of a Normal Population (2-Test)

    4. Testing the Binomial proportion

    5. Testing the Equality of Means of Two Normal Populations, variances known

    6. Testing the Equality of Means of Two Normal Populations, variances unknown but assumed

    equal

    7. Testing the Equality of Means of Two Normal Populations, variances unknown and found

    to be unequal

    8. Testing the Equality of Variances of Two Normal Populations (F-Test)

    9. Testing the Equality of Means of Two Normal Populations, the Paired t-Test.

    The approach for any Parametric Test

    Step 1. State the hypothesis.

    a. Identify the random variable and the relevant parameter to be tested.

    b. Clearly state H0 and H1.

    c. Identify the distribution that applies under H0.

    Step 2. Identify the significance level () at this stage.

    Step 3. Based on H1 and identify the critical threshold(s).

  • 7/30/2019 Univariate Stats

    37/52

    B18 Estimation and Hypothesis Testing

    a. The relevant (standardized) sampling distribution needs to be identified knowing sample

    size/degrees of freedom.

    b. One-sided or two-sided test?

    Step 4. Compute the test statistic from the sampled data.

    Step 5. Identify the test result given the critical level (based on ).Step 6. Report the test result, and the p-value.

    1. The z-Test: One sample test for when the variance is known

    Step 1. State the hypothesis.

    H0 : = 0, H1 : = 0.Under H0, x N(0, 2)

    Step 2. Identify the significance level () at this stage.

    We set = 0.05.

    Step 3. Based on H1 and identify the critical threshold(s).At = 0.05, on the Standard Normal, for a two-sided test, z/2 = 1.96.

    Step 4. Compute the test statistic from the sampled data.

    z =x 0/

    n

    vs. N(0, 1)

    Step 5. Identify the test result given the critical level (based on ).

    Step 6. Report the test result, and the p-value.

    2. The t-Test: One sample test for when the variance is unknownStep 1. State the hypothesis.

    H0 : = 0, H1 : = 0.Under H0, x N(0, 2). 2 is unknown. Standardize and use t/2,n1 curve.

    Step 2. Identify the significance level () at this stage.

    We set = 0.05.

    Step 3. Based on H1 and identify the critical threshold(s).

    Use = 0.05, on the tn1 curve

    Step 4. Compute the test statistic from the sampled data.

    t =x 0s/

    n

    vs. tn1

    Step 5. Identify the test result given the critical level (based on ).

    Step 6. Report the test result, and the p-value.

    Example B.4 12 rats are subjected to forced exercise. Each weight change (in g) is the

    weight after exercise minus the weight before. Does exercise cause weight loss?

    1.7, 0.7, -0.4, -1.8, 0.2, 0.9,

    -1.2, -0.9, -1.8, -1.4, -1.8, -2.0

  • 7/30/2019 Univariate Stats

    38/52

    B19 Parametric hypothesis testing

    Soln:

    Step 1. H0 : = 0, H1 : = 0.Step 2. Use = 0.05

    Step 3. t0.025,11 = 2.201

    Step 4. n = 12. x =

    0.65 g and s2 = 1.5682 g2.

    x s/

    n

    = 1.81

    Step 5. Since |t| < |t/2,n1| then H0 cannot be rejected.Step 6. p-value = ?

    Example B.5 The level of metabolite X measured in 12 patients 24 hours after they

    received a newly proposed antibiotic was 1.2 mg/dL. If the mean and standard deviation of X

    in the general population are 1.0 and 0.4 mg/dL, respectively, then, using a significance level

    of 0.05, test if the level of X in this group is different from that of the general population.

    What happens if = 0.1?

    Soln:

    Step 1. H0 : = 1.0, H1 : = 1.0.Step 2. Use = 0.05

    Step 3. z0.025 = 1.96

    Step 4. n = 12, x = 1.2, = 0.4

    z =x /

    n

    = 1.732 vs. 1.96

    Step 5. H0 cannot be rejected at = 0.05.

    Step 6. p-value = 2 P(z > 1.732) = 2 0.0416 = 0.0832What happens at = 0.1?

    Example B.6 (Example B.5 contd.) Suppose the sample standard deviation of metabo-

    lite X is 0.6 mg/dL. Assume that the (population) standard deviation of X is not known, and

    perform the hypothesis test.

    Soln:

    Step 1. H0 : = 1.0, H1 : = 1.0.Step 2. Use = 0.05

    Step 3. t0.025,11 = 2.201

    Step 4. n = 12, x = 1.2, s = 0.6

    t =x s/

    n

    = 1.155 vs. 2.201

    Step 5. Cannot reject H0.

    Step 6. P(t < 1.155) on a t11 curve is 0.8637.

    p = 2 (1 0.8637) = 0.272

    Example B.7 The nationwide average score in an IQ test = 120. In a poor area, 100 studentscores are analyzed and mean = 115, and standard deviation is 24. Is the area average lower

    than the national average?

  • 7/30/2019 Univariate Stats

    39/52

    B20 Estimation and Hypothesis Testing

    Soln:

    Step 1. H0 : = 120 vs. H1 : < 120

    Step 2. Use = 0.05.

    Step 3. t0.05,99 = 1.66. z0.05 = 1.646. Use either value.

    Since we have H1 : < 120, our critical level is -1.66Step 4.

    t =x s/

    n

    =115 12024/

    100

    = 2.08 vs. 1.66

    Step 5. We reject H0 at a significance level of 0.05.

    Step 6. The p-value is P(t99 < 2.08) = 0.02. Result is statistically significant.

    Could not have rejected H0 at significance level 0.01.

    Example B.8 (Example B.7 contd.) Suppose 10,000 scores were measured with mean

    = 119 and s = 24. Redo the test.

    Soln:

    Step 1. H0 : = 120 vs. H1 : < 120

    Step 2. Use = 0.05.

    Step 3. t0.05,9999 = z0.05 = 1.646.

    Since we have H1 : < 120, our critical level is -1.646

    Step 4.

    t = x s/

    n

    = 119 12024/

    10000

    = 4.17 vs. 1.646

    Step 5. We reject H0 at a significance level of 0.05.

    Step 6. The p-value is P(t9999 < 4.17) = P(z < 4.17) < 0.001. Result is very statistically significant.

    Result is probably not scientifically significant.

    Example B.9 A pilot study of a new BP drug is performed for the purpose of planning a

    larger study. Five patients who have a mean BP of at least 95 mm Hg are recruited for thestudy and are kept on the drug for 1 month. After 1 month the observed mean decline in BP

    in these 5 patients is 4.8 mm Hg with a standard deviation of 9 mm Hg.

    If d= true mean difference in BP between baseline and 1 month, then how many patients

    would be needed to have a 90% chance of detecting a significant difference using a one-tailed

    test with a significance level of 5%?

    Example B.10 (Example B.9 contd.) We go ahead and study the preceding hypothesis

    based on 20 people (we could not find 31). What is the probability that we will be able to

    reject H0 using a one-sided test at the 5% level if the true mean and standard deviation of

    the BP difference is the same as that in the pilot study?

  • 7/30/2019 Univariate Stats

    40/52

    B21 Parametric hypothesis testing

    3. The 2-Test: One sample test for the variance of a Normal population.

    Step 1. State the hypothesis.

    H0 : 2 = 20 , H1 :

    2 = 20 .Standardize s2 and use 2n1 curve.We need both thresholds: 2

    /2,n1and 2

    1/2,n1Step 2. Identify the significance level () at this stage.

    We set = 0.05.

    Step 3. Based on H1 and identify the critical threshold(s).

    Use = 0.05, on the 2/2,n1 curve

    Step 4. Compute the test statistic from the sampled data.

    (n 1)s220

    vs. 2n1

    Step 5. Identify the test result given the critical level (based on ).Step 6. Report the test result, and the p-value.

    Example B.11 We desire to estimate the concentration (g/mL) of a specific dose of

    ampicillin in the urine after various periods of time. We recruit 25 volunteers who have

    received ampicillin and find that they have a mean concentration of 7.0 g/mL, with a

    standard deviation of 2.0 g/mL. Assume that the underlying distribution of concentrations

    is normally distributed.

    1. Find a 95% confidence interval for the population mean concentration.

    Soln: (6.17, 7.83)

    2. Find a 99% confidence interval for the population variance of the concentrations.

    Soln: (2.11, 9.71)

    Example B.12 (Example B.11 contd.) How large a sample would be needed to ensure

    that the length of the confidence interval in part (a) (for the mean) is 0.5 g/mL if we assume

    that the standard deviation remains at 2.0 g/mL?

    Soln: n = 246

    Example B.13 Iron deficiency causes anemia. 21 students from families below the poverty

    level were tested and the mean daily iron intake was found to be 12.50 mg with standard

    deviation 4.75 mg. For the general population, the standard deviation is 5.56 mg. Are the

    two groups comparable?

    Soln: 14.6 lies within (9.591, 34.170) and hence cannot reject H0.

    Soln: Find the p-value yourself.

  • 7/30/2019 Univariate Stats

    41/52

    B22 Estimation and Hypothesis Testing

    4. A Test for the Binomial Proportion

    One sample test for p assuming that the Normal approximation to the Binomial is valid

    Step 1. State the hypothesis.

    H0 : p = p0, H1 : p = p0.Under H0, x N(np0, np0(1 p0)).Standardize and use N(0, 1) curve.

    Step 2. Identify the significance level () at this stage.

    We set = 0.05.

    Step 3. Based on H1 and identify the critical threshold(s).

    Use = 0.05, on the z curve

    Step 4. Compute the test statistic from the sampled data.

    z =x np0

    np0(1 p0)vs. z/2

    Step 5. Identify the test result given the critical level (based on ).Step 6. Report the test result, and the p-value.

    5. A Test for the equality of means of two Normal populations, when the variances are

    known

    Consider measurements xi and yi arising out of two different populations. We assume that

    we have n measurements for xi and m for yi.

    xi N

    x, 2x

    yi N

    y,

    2y

    (B.104)

    Then the point estimates of the means for the two populations are

    x N

    x,2xn

    y N

    y,

    2ym

    (B.105)

    This therefore implies thatRemember that variances of

    independent variables add

    up.x y N

    x y,

    2x

    n+

    2ym

    (B.106)

    Step 1. State the hypothesis.

    H0 : x y = 0, H1 : x y = 0.Under H0, x y follows Eq. B.106. Standardize and use N(0, 1) curve.

    Step 2. Identify the significance level () at this stage.We set = 0.05.

    Step 3. Based on H1 and identify the critical threshold(s).

    At = 0.05, on the Standard Normal, for a two-sided test, z/2 = 1.96.

    Step 4. Compute the test statistic from the sampled data.

    z =(x y)

    $$$$

    $X 0(x y)

    2xn +

    2ym

    vs. N(0, 1)

    If 2x = 2y =

    2 (as would be expected when the samples sets are expected to be coming

    from the same population under H0), then

    x y N

    x y ,

    1

    n+

    1

    m

    (B.107)

  • 7/30/2019 Univariate Stats

    42/52

  • 7/30/2019 Univariate Stats

    43/52

    B24 Estimation and Hypothesis Testing

    The difficulty that remains is identifying the degrees of freedom of the t distribution that

    needs to be looked up. The recommended approach is to find the degrees of freedom d

    as

    follows:

    d

    = S2xn +

    S2ym

    2

    (S2x/n)2n1 + (

    S2y/m)

    2

    m1

    (B.112)

    d

    = d

    rounded down (B.113)

    Step 1. State the hypothesis.

    H0 : x y = 0, H1 : x y = 0.Under H0, x y follows Eq. B.106. Standardize and use td curve.

    Step 2. Identify the significance level () at this stage.

    We set = 0.05.

    Step 3. Based on H1 and identify the critical threshold(s).

    Use = 0.05, on the td curve

    Step 4. Compute the test statistic from the sampled data using Eq. B.111.Step 5. Identify the test result given the critical level (based on ).

    Step 6. Report the test result, and the p-value.

    8. A Test for the equality of variances of two Normal populations (F-test)

    When the population variances for the two Normal populations are unknown, we have (in the

    previous two tests) developed tests for equality of the means, assuming that the variances are

    either equal (Case 6) or unequal (Case 7). What we should have been doing, before testing

    for the equality of the means, was to first ask whether the variances could be assumed equal,

    especially considering that the methodology in Case 6 is simpler than that in Case 7. From

    the discussion on the 2 distribution, we have a ratio of two 2 distributions (with in general

    different degrees of freedom), which is a new, F, distribution, dependent on the two sets of

    degrees of freedoms.

    (n 1)S2x2x

    2n1(m 1)S2y

    2y 2m1 (B.114)

    S2x/2x

    S2y/2y

    2n1/(n 1)

    2m1/(m 1)= Fn1,m1 (B.115)

    Under H0 : 2x =

    2y, we get

    S2x

    S2

    y Fn1,m1 (B.116)

    Then H0 cannot be rejected ifNote that the outcome of the

    F-test should not depend on

    which data set is in the nu-

    merator of Eq. B.116.

    F1/2,n1,m1 S2x

    S2y F/2,n1.m1 (B.117)

    The choice of which set is labeled x and y is not important given the following equivalence

    between two F distributions:

    P

    2n1/(n 1)

    2m1/(m 1)> F,n1,m1

    = P

    2m1/(m 1)2n1/(n 1)

    1

    F,n1,m1

    = 1 = P

    2m1/(m 1)2n1/(n 1)

    > F1,m1,n1

    (B.119)

  • 7/30/2019 Univariate Stats

    44/52

    B25 Parametric hypothesis testing

    which shows that1

    F,n1,m1= F1,m1,n1 (B.120)

    The F-test is summarized as follows:

    Step 1. State the hypothesis.

    H0 : 2x =

    2y , H1 :

    2x = 2y .

    Under H0, we expect Eq. B.116.

    Step 2. Identify the significance level () at this stage.

    We set = 0.05.

    Step 3. Based on H1 and identify the critical threshold(s).

    Use = 0.05, on the Fn1,m1 curveStep 4. Compute the test statistic, S2x/S

    2y , from the sampled data.

    Step 5. Identify the test result given the critical level (based on ).

    Step 6. Report the test result, and the p-value.

    Sample sizes for two sample tests of the meansConsider H0 : x = y and H1 : x = y. Under H0, the lower critical threshold is (instandardized coordinates) z/2. In the original coordinate system, relative to the H0 curve,this threshold is

    $$$$

    $X0(x y) z/2

    2xn

    +2ym

    = z/2

    2xn

    +2ym

    But this coordinate, relative to the H1 curve is known to be (after standardization), z.

    z =

    z/2

    2xn +

    2ym

    (x y)

    2xn +

    2ym

    (B.121)

    This can be rearranged to get, using k = m/n,z/2 + z

    x y

    2=

    n

    2x + 2y/k

    (B.122)

    and hence, given and , the sample sizes required to differentiate between two populations

    are

    n =

    z/2 + z

    x y

    22x +

    2yk

    (B.123)

    m = nk (B.124)

    9. A Test for the equality of means of two Normal populations, when the samples arepaired (the paired t-test)

    Consider n pairs of before and after measurements, which reflect, for example, the efficacy

    of some treatment. Then the n differences di can be used to compute the mean sample

    difference on treatment (= d), and the sample standard deviation sd(di). We have one vector

    of difference values, with the average difference expected to be zero, according to H0. Since

    the population variation is unknown, we have a t test with n 1 degrees of freedom.

    d Edsd(d)

    =d B

    0

    Edsd(di)/

    n

    tn1 (B.125)

    Then, the paired t-test may be summarized as follows.

    Step 1. State the hypothesis.

    H0 : E

    d

    = = 0, H1 : = 0.

  • 7/30/2019 Univariate Stats

    45/52

    B26 Estimation and Hypothesis Testing

    Under H0, we expect Eq. B.125 to hold.

    Step 2. Identify the significance level () at this stage.

    We set = 0.05.

    Step 3. Based on H1 and identify the critical threshold(s).

    Use = 0.05, on the tn1 curve

    Step 4. Compute the test statistic, nd/sd(di), from the sampled data.Step 5. Identify the test result given the critical level (based on ).

    Step 6. Report the test result, and the p-value.

    B.5 Nonparametric hypothesis testing

    When the data available to us

    does not in any fashion obviously follow some distribution,

    or when the data is ordinal in nature rather than cardinal,the parametric hypothesis tests previously described cannot be applied. We then resort to

    nonparametric hypothesis tests. Two essential features of these tests are that we pay attention

    to the median of a data set, rather than the arithmetic mean, and that we endeavor to identify a

    statistic, like a proportion, or a rank, which can be expected to follow a parametric distribution.

    In this manner, we drag things back towards the more convenient parametric hypothesis tests.

    We review below three nonparametric tests which will be seen to be counterparts of some of

    the two sample parametric tests already discussed.

    1. The Wilcoxon sign test.

    2. The Wilcoxon signed rank test.3. The Wilcoxon rank sum test.

    The Wilcoxon sign test

    The paired t-test is ap-

    plied under the assumption

    that all measurements follow

    some Normal distribution. If

    the underlying distribution

    is not Normal, or if it is un-

    known, then this test is a

    safer test.

    The Wilcoxon sign test is the nonparametric counterpart to the paired t test, and is usually

    invoked to evaluate whether two data sets are similar. Let di indicate the relative like or

    dislike for two categories (icecreams). Then we have 3 categories of data: di < 0 where flavor

    A is preferred, di > 0 with B being preferred, and di = 0 with equal preference for both

    flavors. We wish to determine whether there is a marked preference for one flavor versus

    the other. H0 would suggest an equal number of people preferring the two icecreams. The

    actual numbers in the survey voting for the two icecreams should turn out to be close. Note

    that those who cannot make up their minds, with di = 0, are irrelevant to the test of H0.

    Therefore, if n scores have di = 0, of which c have di > 0, the test itself will be to see if c isclose enough to the median of the data set, n/2. If = E[c] is the population median of theunderlying distribution of di, then

    H0 : E[c] = = n2

    H1 : = n2

    Under H0, the binomial proportion of finding nonzero di > 0 is 1/2. Further, assuming the

    Binomial Normal approximation, Var [c] = np(1 p) 10, and under H0, since p = 1/2,we get n 40. Then we reject H0 if c E[c]sd(c)

    =c n/2n/4

    > z/2 (B.126)

  • 7/30/2019 Univariate Stats

    46/52

    B27 Nonparametric hypothesis testing

    A continuity correction is required because we are shifting over from a discrete to a continuous

    distribution. H0 is rejected if

    If c >n

    2:

    c n2

    12

    n/4

    > z/2

    If c < n2

    :c n2 + 12n/4

    > z/2 (B.127)

    Example B.14 2 flavours of icecream undergo a taste test. 45 people are asked to choose,

    and 22 prefer A, 18 prefer B, and 5 cannot make up their minds. Is one flavour preferred over

    the other?

    Soln:

    n = 40, c = 18, the statistic is

    (18 20) + (1/2)40/4

    = 1.510

    = 0.47

    The p-value is 2 (1 (0.47)) = 2 (1 0.6808) 0.64. Hence H0 cannot be rejectedaccording to this test.

    The Wilcoxon signed rank test

    Rather than simply resort to proportions employed in the Sign test, we bring in some more

    detail into the hypothesis test, but ordering the measurements and assigning ranks to them.

    We then expect the ranks to be equally distributed between corresponding quantile groups

    on either side of the median. This therefore ends up testing whether the distribution function

    has similar shapes on either side of the median, as would be expected of H0 where the two

    groups are supposed to be identical. The procedure is as follows:

    1. Arrange the dis in order of absolute value and get frequency counts.

    2. As in the sign test, ignore the di = 0 observations.

    3. Rank the remaining observations from 1 to n.

    4. Let R1 be the sum of ranks for group 1.

    5. H0 assumes that the ranks are equally shared between the two groups, and H1 assumes

    otherwise. Since n ranks are given out, the sum of all ranks is n(n + 1)/2, and hence therank sum for group 1 should be n(n + 1)/4 .

    E[R1] = n(n + 1)4

    , Var [R1] =n(n + 1)(2n + 1)

    24(B.128)

    6. For n 16, we assume that the rank sum follows a Normal distribution

    R1 N

    n(n + 1)

    4,

    n(n + 1)(2n + 1)

    24

    (B.129)

    7. The test statistic T, bringing in the continuity correction again, is

    |R1 n(n+1)

    4 | 12n(n+1)(2n+1)

    24

    (B.130)

  • 7/30/2019 Univariate Stats

    47/52

    B28 Estimation and Hypothesis Testing

    8. A further correction factor is required for the variance, if there are subgroups with tied

    ranks. The variance calculated above will be an overestimate. If g indicates the number of

    tied groups, and ti the number of tied values in that group, then the corrected variance is

    Var [R1] =n(n + 1)(2n + 1)

    24

    g

    i=1

    t3i ti48

    (B.131)

    Equation B.130 needs to be correspondingly updated.

    Example B.15 (Example B.14 contd.) The 45 volunteers rate each icecream on a scale

    from 1 to 10. (data in the margin). (10 = really really like it, 1 = dont care). Let di =

    score(B) - score(A). Then di > 0 B is preferred.

    di < 0 di > 0

    |di| fi fi10 0 0

    9 0 0

    8 1 0

    7 3 0

    6 2 0

    5 2 0

    4 1 0

    3 5 22 4 6

    1 4 10

    Soln:

    di < 0 di > 0 Total # of Range of Average

    |di| fi fi people ranks rank10 0 0 0

    9 0 0 0

    8 1 0 1 40 40

    7 3 0 3 37-39 38

    6 2 0 2 35-36 35.5

    5 2 0 2 33-34 33.5

    4 1 0 1 32 32

    3 5 2 7 25-31 28.0

    2 4 6 10 15-24 19.5

    1 4 10 14 1-14 7.5

    We conveniently choose R1 to be the group corresponding to di > 0 since it has fewer nonzero

    frequencies (the arithmetic becomes simpler).

    R1 = 10(7.5) + 6(19.5) + 2(28.0) = 248

    E[R1] = 40 414

    = 410

    Var [R1] =40

    41

    81

    24 (143

    14) + (103

    10) + (73

    7) + (33

    3) + 2

    (23

    1)

    48 = 5449.75

    sd(R1) =

    5448.75 = 73.82

    T =|(248 410)| (1/2)

    73.82= 2.19

    The p-values = 2(1 0.9857) 0.03 and hence H0 is rejected. Notice that by including theLikely because flavor A is

    preferred.quantile level information, the signed rank test gives a different conclusion to that of the Sign

    test. Since this test has involved more detailed analysis, the conclusion here is preferable to

    that in Example B.14. Observe, from the method employed, that this test is the nonparametric

    equivalent of the paired t test.

  • 7/30/2019 Univariate Stats

    48/52

    B29 Nonparametric hypothesis testing

    The Wilcoxon rank sum test

    aka Mann Whitney test aka

    U TestThe rank sum test is the nonparametric equivalent of the t-test for 2 independent samples.

    The procedure involves assigning ranks after combining the data from the two groups. As

    with the signed rank test, the rank sum for a group is compared with the expected rank sum.

    Let R1 represent the rank sum in the first group, and let there be n1 and n2 samples in thetwo groups. Then, the rank sum of all the ranks is (n1 + n2)(n1 + n2 + 1)/2 and hence the

    average rank for the combined sample is (n1 + n2 + 1)/2. Under H0,

    E[R1] = n1 n1 + n2 + 12

    Var [R1] = n1n2n1 + n2 + 1

    12(B.132)

    If n1 and n2 are 10, the rank sum may be assumed to have an underlying continuous(Normal) distribution:

    R1 N

    n1(n1 + n2 + 1)

    2,

    n1n2(n1 + n2 + 1)

    12

    (B.133)

    and hence the appropriate test statistic is

    T =|R1 n1(n1+n2+1)2 | 12

    n1n2(n1+n2+1)12

    (B.134)

    Once again a correction factor needs to be applied to correct for groups of tied values. If g

    indicates the number of tied groups, and ti the number of tied values in that group, then the

    corrected variance is

    Var [R1] =n1n2(n1 + n2 + 1 S)

    12S =

    gi=1

    t3i ti(n1 + n2)(n1 +