51
Lecture 3: Statistics Review I Date: 9/3/02 Distributions Likelihood Hypothesis tests

Lecture 3: Statistics Review I Date: 9/3/02 Distributions Likelihood Hypothesis tests

Embed Size (px)

Citation preview

Page 1: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Lecture 3: Statistics Review I

Date: 9/3/02DistributionsLikelihoodHypothesis tests

Page 2: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Sources of Variation

Definition: Sampling variation results because we only sample a fraction of the full population (e.g. the mapping population).

Definition: There is often substantial experimental error in the laboratory procedures used to make measurements. Sometimes this error is systematic.

Page 3: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Parameters vs. Estimates

Definition: The population is the complete collection of all individuals or things you wish to make inferences about it. Statistics calculated on populations are parameters.

Definition: The sample is a subset of the population on which you make measurements. Statistics calculated on samples are estimates.

Page 4: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Types of Data

Definition: Usually the data is discrete, meaning it can take on one of countably many different values.

Definition: Many complex and economically valuable traits are continuous. Such traits are quantitative and the random variables associated with them are continuous (QTL).

Page 5: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Random

We are concerned with the outcome of random experiments.

production of gametes union of gametes (fertilization) formation of chiasmata and recombination

events

Page 6: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Set Theory I

Set theory underlies probability. Definition: A set is a collection of objects. Definition: An element is an object in a set. Notation: sS “s is an element in S” Definition: If A and B are sets, then A is a

subset of B if and only if sA implies sB. Notation: AB “A is a subset of B”

Page 7: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Definition: Two sets A and B are equal if and only if AB and BA. We write A=B.

Definition: The universal set is the superset of all other sets, i.e. all other sets are included within it. Often represented as.

Definition: The empty set contains no elements and is denoted as.

Set Theory II

Page 8: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Sample Space & Event

Definition: The sample space for a random experiment is the set that includes all possible outcomes of the experiment.

Definition: An event is a set of possible outcomes of the experiment. An event E is said to happen if any one of the outcomes in E occurs.

Page 9: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Example: Mendel I

Mendel took inbred lines of smooth AA and wrinkled BB peas and crossed them to make the F1 generation and again to make the F2 generation. Smooth A is dominant to B.

The random experiment is the random production of gametes and fertilization to produce peas.

The sample space of genotypes for F2 is AA, BB, AB.

Page 10: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Random Variable

Definition: A function from set S to set T is a rule assigning to each sS, an element tT.

Definition: Given a random experiment on sample space , a function from to T is a random variable. We often write X, Y, or Z. If we were very careful, we’d write X(s).

Simply, X is a measurement of interest on the outcome of a random experiment.

Page 11: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Example: Mendel II

Let X be the number of A alleles in a randomly chosen genotype. X is a random variable.

Sample space is = {0, 1, 2}.

Page 12: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Discrete Probability Distribution

Suppose X is a random variable with possible outcomes {x1, x2, …, xm}. Define the discrete probability distribution for random variable X as

with

i

iiX xx

xxxXxp

0

P

11

iiX xp

Page 13: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Example: Mendel III

0otherwise

25.02

50.01

25.00

X

X

X

X

p

p

p

p

Page 14: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Cumulative Distribution

The discrete cumulative distribution function is defined as

The continuous cumulative distribution function is defined as

ixx

ii xXxFxX PP

xduufxXxF P

Page 15: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Continuous Probability Distribution

If exists, then f(x) is the continuous probability distribution. As in the discrete case,

)()(

)(' xfdx

xdFxF

1

duuf

Page 16: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Expectation and Variance

variablerandom continuousfor

variablerandom discretefor E

duuuf

xpxX ix

ixi

variablerandom continuousfor E

variablerandom discretefor EVar

2

2

duufuu

xpxxX ix

ixii

Page 17: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Moments and MGF

Definition: The rth moment of X is E(Xr). Definition: The moment generating function

is defined as E(etX).

variablerandom continuousfor

variablerandom discretefor Emgf

duufe

xpeeX

tu

xix

tx

tXi

i

Page 18: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Example: Mendel IV

Define the random variable Z as follows:

If we hypothesize that smooth dominates wrinkled in a single-locus model, then the corresponding probability model is given by:

wrinkledis seed if1

smooth is seed if0Z

Page 19: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Example: Mendel V

411P

430P

Z

ZzpZ

4114

1043E Z

163

41

4114

34

10Var22

Z

Page 20: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Joint and Marginal Cumulative Distributions

Definition: Let X and Y be two random variables. Then the joint cumulative distribution is

Definition: The marginal cumulative distribution is

yYxXyxF ,P,

variablesrandom continuousfor ,

variablesrandom discretefor ,PyxX

dvduvuf

yYxXxFxF

Page 21: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Joint Distribution

Definition: The joint distribution is

As before, the sum or integral over the sample space sums to 1.

yYxXyxp ,P,

Page 22: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Conditional Distribution

Definition: The conditional distribution of X given that Y=y is

Lemma: If X and Y are independent, then p(x|y)=p(x), p(y|x)=p(y), and p(x,y)=p(x)p(y).

yY

yYxXyYxX

yp

yxpyxp

P

,PP

,

Page 23: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Example: Mendel VI

P(homozygous | smooth seed) =

3

1

43

41

1P

1,2P12P

Z

ZXZX

Page 24: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Binomial Distribution

Suppose there is a random experiment with two possible outcomes, we call them “success” and “failure”. Suppose there is a constant probability p of success for each experiment and multiple experiments of this type are independent. Let X be the random variable that counts the total number of successes. Then XBin(n,p).

Page 25: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Properties of Binomial Distribution

pnpX

npX

pepppx

neX

ppx

npnxXpnxf

nt

x

xnxtx

xnx

1Var

E

11mgf

1,P,;

0

Page 26: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Examples: Binomial Distribution

recombinant fraction between two loci: count the number of recombinant gametes in n sampled.

phenotype in Mendel’s F2 cross: count the number of smooth peas in F2.

Page 27: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Multinomial Distribution

Suppose you consider genotype in Mendel’s F2 cross, or a 3-point cross.

Definition: Suppose there are m possible outcomes and the random variables X1, X2, …, Xm count the number of times each outcome is observed. Then,

mxm

xx

mmm ppp

xxx

nxXxXxX

21

2121

2211 !!!

!,,,P

mpppn ,,,,M 21

Page 28: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Poisson Distribution

Consider the Binomial distribution when p is small and n is large, but np= is constant. Then,

The distribution obtained is the Poisson Distribution.

!

1x

epp

x

n xxnx

Page 29: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Properties of Poisson Distribution

X

X

ex

eeX

x

exf

tex

tx

x

Var

E!

mgf

!1

Page 30: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Normal Distribution

Confidence intervals for recombinant fraction can be estimated using the Normal distribution.

2,N

Page 31: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Properties of Normal Distribution

2

2

2

Var

E

mgf

2

1

22

2

2

X

X

eX

exf

tt

x

Page 32: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Chi-Square Distribution

Many hypotheses tests in statistical genetics use the chi-square distribution.

kX

kX

tt

X

exk

xf

k

xkk

2Var

E

5.0for 21

1mgf

2

1

2

1

2

2122

Page 33: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Likelihood I

Likelihoods are used frequently in genetic data because they handle the complexities of genetic models well.

Let be a parameter or vector of parameters that effect the random variable X. e.g. =(,) for the normal distribution.

Page 34: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Likelihood II

Then, we can write a likelihood

where we have observed an independent sample of size n, namely x1,x2,…,xn, and conditioned on the parameter .Normally, is not known to us. To find the that best fits the data, we maximize L() over all .

n

iin xxxLL

11 P,

Page 35: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Example: Likelihood of Binomial

n

xp

p

xn

p

x

p

l

pxnpxx

nLl

x

npnxXpnLL xnx

ˆ

1

1loglogloglog

1,P,

Page 36: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

The Score

Definition: The first derivative of the log likelihood with respect to the parameter is the score.

For example, the score for the binomial parameter p is

p

xn

p

x

p

l

1

Page 37: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Information Content

Definition: The information content is

If evaluated at maximum likelihood estimate , then it is called expected information.

xl

xlI

2

2

2

E

E

Page 38: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Hypothesis Testing

Most experiments begin with a hypothesis. This hypothesis must be converted into statistical hypothesis.

Statistical hypotheses consist of null hypothesis H0 and alternative hypothesis HA.

Statistics are used to reject H0 and accept HA. Sometimes we cannot reject H0 and accept it instead.

Page 39: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Rejection Region I

Definition: Given a cumulative probability distribution function for the test statistic X, F(X), the critical region for a hypothesis test is the region of rejection, the area under the probability distribution where the observed test statistic X is unlikely to fall if H0 is true.

The rejection region may or may not be symmetric.

Page 40: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Rejection Region II

1-F(xl) or 1-F(xu)

1-F(xc)

Distribution under H0

Page 41: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Acceptance Region

Region where H0 cannot be rejected.

Page 42: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

One-Tailed vs. Two-Tailed

Use a one-tailed test when the H0 is unidirectional, e.g. H0: 0.5.

Use a two-tailed test when the H0 is bidirectional, e.g. H0: =0.5.

Page 43: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Critical Values

Definition: Critical values are those values corresponding to the cut-off point between rejection and acceptance regions.

Page 44: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

P-Value

Definition: The p-value is the probability of observing a sample outcome, assuming H0 is true.

Reject H0 when the p-value. The significance value of the test is .

xF ˆ1value-p

Page 45: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Chi-Square Test: Goodness-of-Fit

Calculate ei under H0.

2 is distributed as Chi-Square with a-1 degrees of freedom. When expected values depend on k unknown parameters, then df=a-1-k.

a

i i

ii

e

eo

1

22

Page 46: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Chi-Square Test: Test of Independence

eij = np0ip0j

degrees of freedom = (a-1)(b-1) Example: test for linkage

a

i

b

j ij

ijij

e

eo

1 1

2

2

Page 47: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Likelihood Ratio Test

G=2log(LR) G ~ 2 with degrees of freedom equal to the

difference in number of parameters.

XL

XL

0

ˆLR

Page 48: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

LR: goodness-of-fit & independence test

goodness-of-fit

independence test

n

i i

ii e

ooG

1

log2

a

i

b

j ij

ijij e

ooG

1 1

log2

Page 49: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Compare 2 and Likelihood Ratio

Both give similar results. LR is more powerful when there are

unknown parameters involved.

Page 50: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

LOD Score

LOD stands for log of odds. It is commonly denoted by Z.

The interpretation is that HA is 10Z times more likely than H0. The p-values obtained by the LR statistic for LOD score Z are approximately 10-Z.

XL

XLZ

010

ˆlog

Page 51: Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

Nonparametric Hypothesis Testing

What do you do when the test statistic does not follow some standard probability distribution?

Use an empirical distribution. Assume H0 and resample (bootstrap or jackknife or permutation) to generate:

xXX P)(CDF empirical