Upload
percival-adrian-harrell
View
215
Download
2
Embed Size (px)
Citation preview
Lecture 3: Statistics Review I
Date: 9/3/02DistributionsLikelihoodHypothesis tests
Sources of Variation
Definition: Sampling variation results because we only sample a fraction of the full population (e.g. the mapping population).
Definition: There is often substantial experimental error in the laboratory procedures used to make measurements. Sometimes this error is systematic.
Parameters vs. Estimates
Definition: The population is the complete collection of all individuals or things you wish to make inferences about it. Statistics calculated on populations are parameters.
Definition: The sample is a subset of the population on which you make measurements. Statistics calculated on samples are estimates.
Types of Data
Definition: Usually the data is discrete, meaning it can take on one of countably many different values.
Definition: Many complex and economically valuable traits are continuous. Such traits are quantitative and the random variables associated with them are continuous (QTL).
Random
We are concerned with the outcome of random experiments.
production of gametes union of gametes (fertilization) formation of chiasmata and recombination
events
Set Theory I
Set theory underlies probability. Definition: A set is a collection of objects. Definition: An element is an object in a set. Notation: sS “s is an element in S” Definition: If A and B are sets, then A is a
subset of B if and only if sA implies sB. Notation: AB “A is a subset of B”
Definition: Two sets A and B are equal if and only if AB and BA. We write A=B.
Definition: The universal set is the superset of all other sets, i.e. all other sets are included within it. Often represented as.
Definition: The empty set contains no elements and is denoted as.
Set Theory II
Sample Space & Event
Definition: The sample space for a random experiment is the set that includes all possible outcomes of the experiment.
Definition: An event is a set of possible outcomes of the experiment. An event E is said to happen if any one of the outcomes in E occurs.
Example: Mendel I
Mendel took inbred lines of smooth AA and wrinkled BB peas and crossed them to make the F1 generation and again to make the F2 generation. Smooth A is dominant to B.
The random experiment is the random production of gametes and fertilization to produce peas.
The sample space of genotypes for F2 is AA, BB, AB.
Random Variable
Definition: A function from set S to set T is a rule assigning to each sS, an element tT.
Definition: Given a random experiment on sample space , a function from to T is a random variable. We often write X, Y, or Z. If we were very careful, we’d write X(s).
Simply, X is a measurement of interest on the outcome of a random experiment.
Example: Mendel II
Let X be the number of A alleles in a randomly chosen genotype. X is a random variable.
Sample space is = {0, 1, 2}.
Discrete Probability Distribution
Suppose X is a random variable with possible outcomes {x1, x2, …, xm}. Define the discrete probability distribution for random variable X as
with
i
iiX xx
xxxXxp
0
P
11
iiX xp
Example: Mendel III
0otherwise
25.02
50.01
25.00
X
X
X
X
p
p
p
p
Cumulative Distribution
The discrete cumulative distribution function is defined as
The continuous cumulative distribution function is defined as
ixx
ii xXxFxX PP
xduufxXxF P
Continuous Probability Distribution
If exists, then f(x) is the continuous probability distribution. As in the discrete case,
)()(
)(' xfdx
xdFxF
1
duuf
Expectation and Variance
variablerandom continuousfor
variablerandom discretefor E
duuuf
xpxX ix
ixi
variablerandom continuousfor E
variablerandom discretefor EVar
2
2
duufuu
xpxxX ix
ixii
Moments and MGF
Definition: The rth moment of X is E(Xr). Definition: The moment generating function
is defined as E(etX).
variablerandom continuousfor
variablerandom discretefor Emgf
duufe
xpeeX
tu
xix
tx
tXi
i
Example: Mendel IV
Define the random variable Z as follows:
If we hypothesize that smooth dominates wrinkled in a single-locus model, then the corresponding probability model is given by:
wrinkledis seed if1
smooth is seed if0Z
Example: Mendel V
411P
430P
Z
ZzpZ
4114
1043E Z
163
41
4114
34
10Var22
Z
Joint and Marginal Cumulative Distributions
Definition: Let X and Y be two random variables. Then the joint cumulative distribution is
Definition: The marginal cumulative distribution is
yYxXyxF ,P,
variablesrandom continuousfor ,
variablesrandom discretefor ,PyxX
dvduvuf
yYxXxFxF
Joint Distribution
Definition: The joint distribution is
As before, the sum or integral over the sample space sums to 1.
yYxXyxp ,P,
Conditional Distribution
Definition: The conditional distribution of X given that Y=y is
Lemma: If X and Y are independent, then p(x|y)=p(x), p(y|x)=p(y), and p(x,y)=p(x)p(y).
yY
yYxXyYxX
yp
yxpyxp
P
,PP
,
Example: Mendel VI
P(homozygous | smooth seed) =
3
1
43
41
1P
1,2P12P
Z
ZXZX
Binomial Distribution
Suppose there is a random experiment with two possible outcomes, we call them “success” and “failure”. Suppose there is a constant probability p of success for each experiment and multiple experiments of this type are independent. Let X be the random variable that counts the total number of successes. Then XBin(n,p).
Properties of Binomial Distribution
pnpX
npX
pepppx
neX
ppx
npnxXpnxf
nt
x
xnxtx
xnx
1Var
E
11mgf
1,P,;
0
Examples: Binomial Distribution
recombinant fraction between two loci: count the number of recombinant gametes in n sampled.
phenotype in Mendel’s F2 cross: count the number of smooth peas in F2.
Multinomial Distribution
Suppose you consider genotype in Mendel’s F2 cross, or a 3-point cross.
Definition: Suppose there are m possible outcomes and the random variables X1, X2, …, Xm count the number of times each outcome is observed. Then,
mxm
xx
mmm ppp
xxx
nxXxXxX
21
2121
2211 !!!
!,,,P
mpppn ,,,,M 21
Poisson Distribution
Consider the Binomial distribution when p is small and n is large, but np= is constant. Then,
The distribution obtained is the Poisson Distribution.
!
1x
epp
x
n xxnx
Properties of Poisson Distribution
X
X
ex
eeX
x
exf
tex
tx
x
Var
E!
mgf
!1
Normal Distribution
Confidence intervals for recombinant fraction can be estimated using the Normal distribution.
2,N
Properties of Normal Distribution
2
2
2
Var
E
mgf
2
1
22
2
2
X
X
eX
exf
tt
x
Chi-Square Distribution
Many hypotheses tests in statistical genetics use the chi-square distribution.
kX
kX
tt
X
exk
xf
k
xkk
2Var
E
5.0for 21
1mgf
2
1
2
1
2
2122
Likelihood I
Likelihoods are used frequently in genetic data because they handle the complexities of genetic models well.
Let be a parameter or vector of parameters that effect the random variable X. e.g. =(,) for the normal distribution.
Likelihood II
Then, we can write a likelihood
where we have observed an independent sample of size n, namely x1,x2,…,xn, and conditioned on the parameter .Normally, is not known to us. To find the that best fits the data, we maximize L() over all .
n
iin xxxLL
11 P,
Example: Likelihood of Binomial
n
xp
p
xn
p
x
p
l
pxnpxx
nLl
x
npnxXpnLL xnx
ˆ
1
1loglogloglog
1,P,
The Score
Definition: The first derivative of the log likelihood with respect to the parameter is the score.
For example, the score for the binomial parameter p is
p
xn
p
x
p
l
1
Information Content
Definition: The information content is
If evaluated at maximum likelihood estimate , then it is called expected information.
xl
xlI
2
2
2
E
E
Hypothesis Testing
Most experiments begin with a hypothesis. This hypothesis must be converted into statistical hypothesis.
Statistical hypotheses consist of null hypothesis H0 and alternative hypothesis HA.
Statistics are used to reject H0 and accept HA. Sometimes we cannot reject H0 and accept it instead.
Rejection Region I
Definition: Given a cumulative probability distribution function for the test statistic X, F(X), the critical region for a hypothesis test is the region of rejection, the area under the probability distribution where the observed test statistic X is unlikely to fall if H0 is true.
The rejection region may or may not be symmetric.
Rejection Region II
1-F(xl) or 1-F(xu)
1-F(xc)
Distribution under H0
Acceptance Region
Region where H0 cannot be rejected.
One-Tailed vs. Two-Tailed
Use a one-tailed test when the H0 is unidirectional, e.g. H0: 0.5.
Use a two-tailed test when the H0 is bidirectional, e.g. H0: =0.5.
Critical Values
Definition: Critical values are those values corresponding to the cut-off point between rejection and acceptance regions.
P-Value
Definition: The p-value is the probability of observing a sample outcome, assuming H0 is true.
Reject H0 when the p-value. The significance value of the test is .
xF ˆ1value-p
Chi-Square Test: Goodness-of-Fit
Calculate ei under H0.
2 is distributed as Chi-Square with a-1 degrees of freedom. When expected values depend on k unknown parameters, then df=a-1-k.
a
i i
ii
e
eo
1
22
Chi-Square Test: Test of Independence
eij = np0ip0j
degrees of freedom = (a-1)(b-1) Example: test for linkage
a
i
b
j ij
ijij
e
eo
1 1
2
2
Likelihood Ratio Test
G=2log(LR) G ~ 2 with degrees of freedom equal to the
difference in number of parameters.
XL
XL
0
ˆLR
LR: goodness-of-fit & independence test
goodness-of-fit
independence test
n
i i
ii e
ooG
1
log2
a
i
b
j ij
ijij e
ooG
1 1
log2
Compare 2 and Likelihood Ratio
Both give similar results. LR is more powerful when there are
unknown parameters involved.
LOD Score
LOD stands for log of odds. It is commonly denoted by Z.
The interpretation is that HA is 10Z times more likely than H0. The p-values obtained by the LR statistic for LOD score Z are approximately 10-Z.
XL
XLZ
010
ˆlog
Nonparametric Hypothesis Testing
What do you do when the test statistic does not follow some standard probability distribution?
Use an empirical distribution. Assume H0 and resample (bootstrap or jackknife or permutation) to generate:
xXX P)(CDF empirical