Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
1
Lecture 12:
Analysis of Selection
Experiments
Bruce Walsh lecture notes
Synbreed course
version 4 July 2013
2
Variance in the Response to Selection
R = h2S is just the expected value of
the response, but there is a variance
about this value.
Hence, identically-selected replicate lines are
Still expected to show variation is response
The major source of such variation is genetic
drift
3
Consider the mean in generation t
zt = µ + gt + dt + et
µ = Mean of the original population
gt = The mean breeding value in generation t
dt = the effect of any major environmental trend in generation t
et = error in estimating the environmental-corrected mean breeding value
from the mean phenotype of a sample
Under this model, the mean of a replicate series of
lines is
E(zt ) = µ + E(gt ) + dt R = h2S
4
Consider the mean in generation t
E(zt ) = µ + E(gt ) + dt
The variance is given by
σ2z(t) = σ2
g(t) + σ2e(t) + σ2
d
Variance in the environmental trend (mean is set to be
zero)
Variance in the breeding value at generation t
In generation t, Mt individuals are measured. An upper bound
on the error variance is Var(z)/Mt
σ2e(t) =
σ2z
Mt
5
Variance in Breeding ValuesTwo sources of variation
(i) Sampling variance in the founding lines
(ii) Genetic drift (inbreeding) within each line
M0 = Size of the founding population
ft = Inbreeding in
generation t
σ2g(t) =
(1
M0+ 2ft
)σ2
A =(
1M0
+ 2ft
)h2σ2
z
t)([2ft = 2 1− 1− 1
2Ne
]" t/Ne for t/Ne << 1
The mean breeding values in different generations of the same
replicate line are correlated,
( )0
σ(gt , gt!) = σ ( zt, zt! ) =1
M+ 2ft h2σ2
z for t < t!
6
Variance-covariance
structure within a lineAssume the initial sample is sufficiently large so
that we can ignore 1/M0
Variance: !2(gt) = (t/Ne)h2!2z
Covariance: !2(gt, gx) = (x/Ne)h2!2z for x < t
These expressions (which are often called the
pure-drift approximations) will prove useful in
the statistical analysis of selection response
7
The Realized Heritability
Since R = h2 S, this suggests h2 = R/S, so that
the ratio of the observed response over the
observed differential provides an estimate of
the heritability, the realized heritability
Obvious definition for a single generation of response.
What about for multiple generations of response?
Cumulative selection response = sum of all responses
RC(t =t
i=1
R(i))∑
8
Cumulative selection differential = sum of the S’s
SC(t) =t
=1
S(i)∑
i
(1) The Ratio Estimator for realized heritability
= total response/total differential,
(T )Rh2
r = C
SC(T )(2) The Regression Estimator --- the slope of the
regression of cumulative response on cumulative differential
RC(t) = h2r SC(t) + et
∑b =∑
t SC(t)RC(t)
t SC(t)2Regression passes through the
origin (R = 0 when S = 0). Slope =
9
Example: Divergent selection on abdominal bristle
number in Drosophila (data from T. MacKay)
t z z S(t) R(t) SC (t) RC (t)
1 18.02 20.10 20.10−18.02 = 2.08 18.34−18.02 = 0.32 .08 0.322 18.34 21.00 21.00−18.34 = 2.66 19.05−18.34 = 0.71 .74 1.033 19.05 21.75 21.75−19.05 = 2.70 20.07−19.05 = 1.02 .44 2.054 20.07 22.55 22.55−20.07 = 2.48 20.36−20.07 = 0.29 .92 2.345 20.36 22.95 22.95−20.36 = 2.59 20.65−20.36 = 0.29 12.51 2.636 20.65
*
Ratio estimate: h2 = 2.63/12.51 = 0.2102
Regression (OLS) estimator
h 2r = bC(OLS) =
∑5i=1 SC(i) · RC(i)∑5
i=1 S 2C(i)
=78.96350.45
= 0.2245
10
60\0
Cumulative Differential
0
5
10
15
20C
um
ula
tiv
e R
esp
on
se
Note x axis is differential,
NOT generations
Ratio estimator
= 2.63/12.51 = 0.210
Slope = 0.2245
= Regression
estimator
11
Standard error of the Ratio Estimator
Ratio Estimator, h2r = RT/ST
Recall that the variance for the mean in generation
t is !2(gt) + !2(e) + !2(d)
Assume M0 >> 1 and that we can ignore the
environmental trend variance, then
!2(RT) = !2(gt) + !2(e) = (T/Ne)h2!2z + !2
z /MT
!2(RT/ST ) = !2(RT ) /(ST )2
This follows since !2(ax) = a2!2 (x)
Hence, !2(h2r) = [ (T/Ne)h2!2
z + !2z /MT ] /(ST )2
12
SE for (OLS) Regression
Estimator
RC(t) = h2r SC(t) + et
The basic linear model is
Under the OLS framework (residuals homoscedastic
and uncorrelated), the linear model has the design
matrix X just the vector Sc of cumulative differential
and y = R
X = Sc
" = bc y = R
"bC(OLS) =
(XT X
) 1XT y =
(ST S
)"1ST R
∑=
Ti=1 SC(i) · RC(i)
Ti=1 S 2
C(i)∑
13
SE for (OLS) Regression
Estimator
( ( ))- -
Var[bC(OLS)
]= σ2
e XTX1
= σ2e ST S
1
= σ2e
/ T∑
i=1
S 2C(i)
σ 2e =
1T − 1
T∑
i=1
e 2i =
1T − 1
T∑
i=1
(RC(i)− h 2
r SC(i))2
14
Problems with OLS regression approach
Although the OLS regression estimator for realized
heritability is very widely used, it has fatal problems
OLS assumes the residuals are homoscedastic and
uncorrelated. In reality, the covariance structure is
!2(ei) = (i/Ne)h2!2z + !2
z /Mi
!2(ek, ei) = (i/Ne)h2!2z for i < k
Hence, the GLS regression is more appropriate
The OLS gives unbiased estimates of the realized
heritability, but it seriously underestimates its SE
15
GLS regression Estimate
RC(t) = h2r SC(t) + et
X = Sc
" = bc y = R
The variance-covariance matrix V has elements
Vii = (i/Ne) h2 !2z + !2
z /Mi
Vji = Vij = (i/Ne) h2 !2z for i < j
We can directly estimate the phenotypic variance from the data
h2 is what we are trying to estimate. Use an iterative approach.
Try some initial value, use GLS to update this value, use the new
value for next round of updating. Continue until values stabilize.
- --bC(GLS) = (ST V 1S) 1ST V 1R
16
Example: GLS estimator for MacKay’s data
Here M = 100, N = 20, while Var(z) = 3.293 Building up the covariance matrix V
V = 0.1647·
h2 + 0.2 h2 h2 h2 h2
h2 2h2 + 0.2 2h2 2h2 2h2
h2 2h2 3h2 + 0.2 3h2 3h2
h2 2h2 3h2 4h2 + 0.2 4h2
h2 2h2 3h2 4h2 5h2 + 0.2
σ2 [RC(i) ] =(
i
N
)h2σ2
z +σ2
z
M= i · h2 · 0.1647 + 0.03292* *
σ [RC(i), RC(j) ] =(
i
N
)h2σ2
z = i · h2 · 0.1647 for i < j* *
Start with some initial value for h2, update V, obtain new
estimate. Repeat until convergence. Start with h2 = 0.21
h2r = bC(GLS)(1) = (STV"1S)"1STV"1 = 0.222197
Using this update, next estimate is 0.222135, which remains unchanged
17
Standard errors of the three estimates
For the OLS regression,
Var[bC(O S)
]L = σ2
e
/ T∑
i=1
S 2C(i)
σ 2e =
1T − 1
T∑
i=1
e 2i =
1T − 1
T∑
i=1
(RC(i)− h 2
r SC(i))2
= 0.091/4 = 0.0228
= 0.0228/350.45 = 0.0000649
Hence, SE for OLS estimator is 0.0081
!2(h2r) = [ (T/Ne)h2!2
z + !2z /MT ] /(ST )2
= [(5/20)*0.21*3.292 + 0.03292]/12.512 = 0.00132
For the ratio estimator,
Giving a SE of 0.0363
18
Standard errors of the three estimates
Finally, for the GLS regression,
(STV-1S)-1 = 1/790.4 = 0.001265
Hence, SE for GLS estimator is 0.0356
OLS: h2 = 0.2245 + 0.0081
Ratio: h2 = 0.2102 + 0.0363
GLS: h2 = 0.2221 + 0.0356
Much to small!
Summarizing:
19
Just how well does the
breeder’s equation work?
Sheridan (1988) compared realized heritability
estimates with estimates of heritability obtained
from resemblances between relatives in the
base populations
Punch-line: Good, but not great, fit in many settings
Problems with a wider meta-analysis is that standard
errors are often not presented nor is the data presented
in a form that allows their calculation.
20
157 (47%)8 (53%)Swine/Sheep
116 (55%)5 (45%)Poultry/Quail
3428 (82%)6 (18%)Mice/Rats
2619 (73%)7 (27%)Tribolium
6147 (77%)14 (23%)Drosophila
TotalNS differenceSignificant
Differences
Species
Comparison of realized and (relative-based)
Heritability estimates
21
Lots of ways for Breeder’s
equation to fail
• One: Inheritance model is wrong, andinfinitesimal model (large number ofloci, each of small effect) is incorrect
– However, even with just a single majorgene, BE is pretty good
• Two: Additional important featuresnot accounted by BE
22
23
24
Asymmetric Selection Response
Divergent Selection Experiment: Select some replicate
lines for increased trait value, others for decreased
value
Expectation: roughly
equal response in up and
down directions, R = h2S
Often an asymmetric response
is observed, with a significant
difference in the slope of up
vs. down-selection lines
Rc
Sc
25
Potential Causes: I. Design
Defects
• Different selection differentials (Plot is Rc vs. t,
not the correct plot of Rc vs. Sc)
• Drift (sample size not sufficiently large)
• Scale effects
• Undetected environmental trends
• Transient effects from previous selection
– Decay of epistatic response
• Undetected selection on correlated traits
26
Scale effects
When the trait biologically cannot go below
a specific value (i.e., 0), as we down-select towards
zero, expect less response.
Transform to a log
scale
27
Potential Causes: II. Nonlinear
Parent-Offspring regression
+S
+ h2S
Parent
Off
spri
ng
-S
- h2S
Linearity gives a symmetric response with
+S, -S
28
However, if PO regression is
nonlinear
+S
Parent
Off
spri
ng
-S
With this nonlinear regression, larger
absolute response for +S than for -S
29
Potential Causes: III.
Inbreeding depression
Change in mean due to
inbreeding depression.
True genetic response
in the absence of inbreeding
Depresses upward response,
Enhances downward response
30
Potential Causes: IV.
Genetic Asymmetry
• Requires changes in allele frequencies.
• The same absolute change in an allele frequency
can result in rather different changes in the
variance in the + vs. - change direction.
• This results in departures in the additive genetic
variance in up vs. down-selected lines, and hence
changes in h2 and response.
31
Additive variance, VA, with no dominance (k = 0)
Allele frequency, p
VA
When p = 1/2,
VarA(p+#) = VarA(p-#)
When p not 1/2,
VarA(p+#) not VarA(p+#)
p+#
p-#
p+#p-#
32
Additive variance, VA, with complete dominance (k = 1)
33
Additive variance, VA, with overdominance (k = 10)
34
Control Populations
zs,t = µ + gs,t + dt + es,t
zc,t = µ + gc,t + dt + ec,t
Until now, we have been ignoring the bias caused by
not accounting for any environmental trend.
One way to deal with this is to include an unselected
control population in the design
Hence,
E( zs,t − zc,t ) = E(gs,t )−E(gc,0 ) = h2SC(t)
35
E( zs,t − zc,t ) = h2SC(t) + (ds,t − dc,t)- -
−− −Rt = ( zs,t zc,t) ( zs,t"1 zc,t"1)
RC(t) = zs,t zc,t
St = z!s,t"1 − zs,t"1
-
Estimating trends with a control population
Complication 1: If G x E is present, then
Complication 2: Selection results in faster
inbreeding, so control must to comparatively inbred
to fully account for inbreeding depression
The use of a control also accounts for inbreeding depression
36
Divergent Selection Designs
zu,t = µ + gu,t + dt + eu,t
zd,t = µ + gd,t + dt + ed,t
Rt = ( zu,t− zu,t"1)− (zd,t− zd,t"1)RC(t) = zu,t − zd,t
St = ( z!u,t"1 − zu,t"1)− ( z!d,t"1− zd,t"1)
An alternative experimental design to remove
a common environmental trend is the divergent
selection design
Note that this design also accounts for inbreeding
depression (assuming up/down lines equally inbred)
Response estimated by
37
Variance in ResponseWe have been assuming that we can ignore !2
d.
With a control line and/or divergent selection, don’t have
to worry about this.
Control:
RC(t) = zs,t − zc,t = ( µ + gs,t + dt + es,t)− (µ + gc,t + dt + ec,t )= gs,t − gc,t + es,t − ec,t
The common dt term cancels
RC(t) = zsu,t− zsd,t = gsu,t − gsd,t + esu,t − esd,t
Divergent design
Again, common dt term cancels
38
1/Mu,t + 1/Md,t1/Nu + + 1/Ndfu,t + fd,tDivergent Selection,
no control
1/Ms,t + 1/Mc,t1/Ns + + 1/Ncfs,t + fc,tSelection in a single
direction, with control
1/Ms,t1/Nsfs,tSelection in a single
direction, no control
B (t > t’)AftDesign
σ2 [ RC(t) ] = (2ft +B0) 2ft h2σ2z +Bt σ2
z
" (t A + B0) h2σ2z + Bt σ2
z
σ [ RC(t), RC(t!) ] = (2ft + B0) h2σ2z
" (tA+ B0)σ2z h2 for t < t!
The resulting variance and covariances in response become
39
Variance with a Control
σ2(RC(t) ) =(
t
N+
1M0
)h2σ2
z +1M
σ2z −σ2
d
tσ2zh2
N> σ2
d
When does the use of a control population result
in a reduced variance?
Control populations are not without a cost.
Variance w/ control - variance without control =
Hence (ignoring M terms),
Regardless of the value of !2d, if sufficient generations are used, the optimal design (in
terms of giving the smallest expected variance in response) is not to use a control.
However, this approach runs the risk of an undetected directional environmental
trend compromising the estimated heritability.
40
Optimal Experimental Design
CV[ RC(t) ] =σ[RC(t) ]E[RC(t)]
(1/2Nt)1/2 /hi 2th2i!zDivergent Selection,
no control
(1/Nt)1/2/hi th2i!zSelection in one
direction, no control
(2/Nt)1/2/hi th2i!zSelection in one
direction, with control
CV [ R(t) ]E [ R(t) ]Design
The coefficient of variance (CV) provides one measure
for comparing different designs
CV scales with
Nt = total #
over the entire
experiment
41
ExampleSuppose we plan to select the upper 5% of the population
on a trait with h2 = 0.25
How large must N be to give a CV of 0.01 when no
control is used?
p = 0.05 --> i = 2.06. Assuming drift variance dominates
!2d, then
CV = 0.01 = (1/Nt)1/2/hi = (1/Nt)1/2/(0.5*2.06)
or Nt = 1/(0.01*0.5*2.06)2 = 9426
Hence, we must have at least 9,426 selected parents
over the course of the experiment
42
Moving from LS to MM and Bayes• Thus far, we have assumed a least-squares
(LS) analysis, which only uses informationfrom the generation means.
• When the pedigree is known, mixed-model(MM, e.g., the animal model) is much morepowerful, using all of the data and easilyhandling unbalanced design. MM is alikelihood analysis.
• Even better, a Bayes MM analysis fullyaccounts for model uncertainty (e.g., usingan estimated variance for BLUP).
43
Mixed-Model EstimationPROVIDED that we have the full pedigree of
individuals in the selection experiment, we can
use mixed-model methodology (e.g., BLUP & REML)
Power: Mixed-model accounts for ALL the
covariances in the sample, not just those
between means in different generations, but
also ALL of the covariances between related
individuals.
With sufficient connectiveness (links between relatives)
between generations, can estimate the genetic trend
without requiring any control population
44
Basic model: the animal model
yij = µ + aij + eij
Vectorize the
data as
y =
y0
y1
y2...yt
, where yi =
yi1
yi2...
yini
Var(a) = !2AA
y = 1µ + a + eThe (simple) model becomes
Here, Aii = (1+fi), Aij = 2$ij
Var(e) = !2e I
With additional fixed effects, y = Xb + a + e
45
The estimated mean in generation k is the average
of the estimated breeding values in generation k,
ak =1nk
nk∑
j=1
akj
Interesting complication: The BLUP estimate of a
requires a prior estimate of the heritability h2.
REML/BLUP (also called empirical BLUP). First
uses the data to obtain a REML estimate of Var(A),
then use this for the BLUPs
A Bayesian analysis fully accounts for the uncertainty
in using an estimate for Var(A)
46
REML/BLUP analysis vs. LS analysis
LS analysis uses regression to estimate realized
heritability.
BLUP produces smoother estimates, as individual BVs
are compared with an index based on their relatives.
BV values are regressed towards the value predicted
by the index, smoothing variations out
BLUP assumes a base population heritability and then
computes the genetic trend by plotting mean BV’s
47
The relationship matrix A fully accounts for the effects
of drift and the generation of linkage disequilibrium
(assuming the infinitesimal model holds).
This occurs because even in the face of drift and
selection the covariance matrix of breeding values
is the product of base-population !2A times the
relationship matrix (under the infinitesimal model)
This independence occurs because of the nature
of the breeding value regression
Ai = (1/2) Afi + (1/2)Ami + si
The segregation residual s (also referred to
as Mendelian sampling) is the variation generated
by heterozygous parents
48
Segregation residual, s, is the key here
Under the infinitesimal model, si is independent
of parental breeding values, with mean zero
Var(si) = (1- fi ) (!2A /2)
Where fi is the mean inbreeding for i’s parents,
and !2A is the base population additive variance
More generally, the vector s ~ MVN(0, [!2A /2] F)
Where F is a diagonal matrix with i-th element
Fii = (1− f i ) =(
1− fk + fj
2
)=
(2− Akk + Ajj
2
)
Distribution of s unaffected by BVs of parents, and
hence selection and/or assortative mating
49
Going BayesianA MCMC sampler for a Bayesian mixed model is
straightforward (see WL Appendix 3).
One important (but very subtle) issue is that PBVs
(predicted breeding values) are used in estimate
a genetic trend. The problem is that PBVs are
correlated, and hence have a GLS structure. This
is especially problematic with the smaller pedigrees
found in wild populations. Results in mean PBV-year
regressions being unbiased but highly anticonservative
(p values are highly biased towards small values)
Using the distribution of slopes in an MCMC sampler
avoids this problem
50
Example of Bayesian analysis in Large White pigs
Blasco et al (1998)
51
LS, MM, or Bayes?
52