Lecture 12: Analysis of Selection Experimentsnitro.biosci.arizona.edu/.../Lectures/Lecture12.pdf · Lecture 12: Analysis of Selection Experiments Bruce Walsh lecture notes Synbreed

1

Lecture 12:

Analysis of Selection

Experiments

Bruce Walsh lecture notes

Synbreed course

version 4 July 2013

2

Variance in the Response to Selection

R = h2S is just the expected value of

the response, but there is a variance

about this value.

Hence, identically-selected replicate lines are

Still expected to show variation is response

The major source of such variation is genetic

drift

3

Consider the mean in generation t

zt = µ + gt + dt + et

µ = Mean of the original population

gt = The mean breeding value in generation t

dt = the effect of any major environmental trend in generation t

et = error in estimating the environmental-corrected mean breeding value

from the mean phenotype of a sample

Under this model, the mean of a replicate series of

lines is

E(zt ) = µ + E(gt ) + dt R = h2S

4

Consider the mean in generation t

E(zt ) = µ + E(gt ) + dt

The variance is given by

σ2z(t) = σ2

g(t) + σ2e(t) + σ2

d

Variance in the environmental trend (mean is set to be

zero)

Variance in the breeding value at generation t

In generation t, Mt individuals are measured. An upper bound

on the error variance is Var(z)/Mt

σ2e(t) =

σ2z

Mt

5

Variance in Breeding ValuesTwo sources of variation

(i) Sampling variance in the founding lines

(ii) Genetic drift (inbreeding) within each line

M0 = Size of the founding population

ft = Inbreeding in

generation t

σ2g(t) =

(1

M0+ 2ft

)σ2

A =(

1M0

+ 2ft

)h2σ2

z

t)([2ft = 2 1− 1− 1

2Ne

]" t/Ne for t/Ne << 1

The mean breeding values in different generations of the same

replicate line are correlated,

( )0

σ(gt , gt!) = σ ( zt, zt! ) =1

M+ 2ft h2σ2

z for t < t!

6

Variance-covariance

structure within a lineAssume the initial sample is sufficiently large so

that we can ignore 1/M0

Variance: !2(gt) = (t/Ne)h2!2z

Covariance: !2(gt, gx) = (x/Ne)h2!2z for x < t

These expressions (which are often called the

pure-drift approximations) will prove useful in

the statistical analysis of selection response

7

The Realized Heritability

Since R = h2 S, this suggests h2 = R/S, so that

the ratio of the observed response over the

observed differential provides an estimate of

the heritability, the realized heritability

Obvious definition for a single generation of response.

What about for multiple generations of response?

Cumulative selection response = sum of all responses

RC(t =t

i=1

R(i))∑

8

Cumulative selection differential = sum of the S’s

SC(t) =t

=1

S(i)∑

i

(1) The Ratio Estimator for realized heritability

= total response/total differential,

(T )Rh2

r = C

SC(T )(2) The Regression Estimator --- the slope of the

regression of cumulative response on cumulative differential

RC(t) = h2r SC(t) + et

∑b =∑

t SC(t)RC(t)

t SC(t)2Regression passes through the

origin (R = 0 when S = 0). Slope =

9

Example: Divergent selection on abdominal bristle

number in Drosophila (data from T. MacKay)

t z z S(t) R(t) SC (t) RC (t)

1 18.02 20.10 20.10−18.02 = 2.08 18.34−18.02 = 0.32 .08 0.322 18.34 21.00 21.00−18.34 = 2.66 19.05−18.34 = 0.71 .74 1.033 19.05 21.75 21.75−19.05 = 2.70 20.07−19.05 = 1.02 .44 2.054 20.07 22.55 22.55−20.07 = 2.48 20.36−20.07 = 0.29 .92 2.345 20.36 22.95 22.95−20.36 = 2.59 20.65−20.36 = 0.29 12.51 2.636 20.65

*

Ratio estimate: h2 = 2.63/12.51 = 0.2102

Regression (OLS) estimator

h 2r = bC(OLS) =

∑5i=1 SC(i) · RC(i)∑5

i=1 S 2C(i)

=78.96350.45

= 0.2245

10

60\0

Cumulative Differential

0

5

10

15

20C

um

ula

tiv

e R

esp

on

se

Note x axis is differential,

NOT generations

Ratio estimator

= 2.63/12.51 = 0.210

Slope = 0.2245

= Regression

estimator

11

Standard error of the Ratio Estimator

Ratio Estimator, h2r = RT/ST

Recall that the variance for the mean in generation

t is !2(gt) + !2(e) + !2(d)

Assume M0 >> 1 and that we can ignore the

environmental trend variance, then

!2(RT) = !2(gt) + !2(e) = (T/Ne)h2!2z + !2

z /MT

!2(RT/ST ) = !2(RT ) /(ST )2

This follows since !2(ax) = a2!2 (x)

Hence, !2(h2r) = [ (T/Ne)h2!2

z + !2z /MT ] /(ST )2

12

SE for (OLS) Regression

Estimator


The basic linear model is

Under the OLS framework (residuals homoscedastic

and uncorrelated), the linear model has the design

matrix X just the vector Sc of cumulative differential

and y = R

X = Sc

" = bc y = R

"bC(OLS) =

(XT X

) 1XT y =

(ST S

)"1ST R

∑=

Ti=1 SC(i) · RC(i)

Ti=1 S 2

C(i)∑

13

SE for (OLS) Regression

Estimator

( ( ))- -

Var[bC(OLS)

]= σ2

e XTX1

= σ2e ST S

1

= σ2e

/ T∑

i=1

S 2C(i)

σ 2e =

1T − 1

T∑

i=1

e 2i =

1T − 1

T∑

i=1

(RC(i)− h 2

r SC(i))2

14

Problems with OLS regression approach

Although the OLS regression estimator for realized

heritability is very widely used, it has fatal problems

OLS assumes the residuals are homoscedastic and

uncorrelated. In reality, the covariance structure is

!2(ei) = (i/Ne)h2!2z + !2

z /Mi

!2(ek, ei) = (i/Ne)h2!2z for i < k

Hence, the GLS regression is more appropriate

The OLS gives unbiased estimates of the realized

heritability, but it seriously underestimates its SE

15

GLS regression Estimate


X = Sc

" = bc y = R

The variance-covariance matrix V has elements

Vii = (i/Ne) h2 !2z + !2

z /Mi

Vji = Vij = (i/Ne) h2 !2z for i < j

We can directly estimate the phenotypic variance from the data

h2 is what we are trying to estimate. Use an iterative approach.

Try some initial value, use GLS to update this value, use the new

value for next round of updating. Continue until values stabilize.

- --bC(GLS) = (ST V 1S) 1ST V 1R

16

Example: GLS estimator for MacKay’s data

Here M = 100, N = 20, while Var(z) = 3.293 Building up the covariance matrix V

V = 0.1647·

h2 + 0.2 h2 h2 h2 h2

h2 2h2 + 0.2 2h2 2h2 2h2

h2 2h2 3h2 + 0.2 3h2 3h2

h2 2h2 3h2 4h2 + 0.2 4h2

h2 2h2 3h2 4h2 5h2 + 0.2

σ2 [RC(i) ] =(

i

N

)h2σ2

z +σ2

z

M= i · h2 · 0.1647 + 0.03292* *

σ [RC(i), RC(j) ] =(

i

N

)h2σ2

z = i · h2 · 0.1647 for i < j* *

Start with some initial value for h2, update V, obtain new

estimate. Repeat until convergence. Start with h2 = 0.21

h2r = bC(GLS)(1) = (STV"1S)"1STV"1 = 0.222197

Using this update, next estimate is 0.222135, which remains unchanged

17

Standard errors of the three estimates

For the OLS regression,

Var[bC(O S)

]L = σ2

e

/ T∑

i=1

S 2C(i)

σ 2e =

1T − 1

T∑

i=1

e 2i =

1T − 1

T∑

i=1

(RC(i)− h 2

r SC(i))2

= 0.091/4 = 0.0228

= 0.0228/350.45 = 0.0000649

Hence, SE for OLS estimator is 0.0081

!2(h2r) = [ (T/Ne)h2!2

z + !2z /MT ] /(ST )2

= [(5/20)*0.21*3.292 + 0.03292]/12.512 = 0.00132

For the ratio estimator,

Giving a SE of 0.0363

18

Standard errors of the three estimates

Finally, for the GLS regression,

(STV-1S)-1 = 1/790.4 = 0.001265

Hence, SE for GLS estimator is 0.0356

OLS: h2 = 0.2245 + 0.0081

Ratio: h2 = 0.2102 + 0.0363

GLS: h2 = 0.2221 + 0.0356

Much to small!

Summarizing:

19

Just how well does the

breeder’s equation work?

Sheridan (1988) compared realized heritability

estimates with estimates of heritability obtained

from resemblances between relatives in the

base populations

Punch-line: Good, but not great, fit in many settings

Problems with a wider meta-analysis is that standard

errors are often not presented nor is the data presented

in a form that allows their calculation.

20

157 (47%)8 (53%)Swine/Sheep

116 (55%)5 (45%)Poultry/Quail

3428 (82%)6 (18%)Mice/Rats

2619 (73%)7 (27%)Tribolium

6147 (77%)14 (23%)Drosophila

TotalNS differenceSignificant

Differences

Species

Comparison of realized and (relative-based)

Heritability estimates

21

Lots of ways for Breeder’s

equation to fail

• One: Inheritance model is wrong, andinfinitesimal model (large number ofloci, each of small effect) is incorrect

– However, even with just a single majorgene, BE is pretty good

• Two: Additional important featuresnot accounted by BE

22

23

24

Asymmetric Selection Response

Divergent Selection Experiment: Select some replicate

lines for increased trait value, others for decreased

value

Expectation: roughly

equal response in up and

down directions, R = h2S

Often an asymmetric response

is observed, with a significant

difference in the slope of up

vs. down-selection lines

Rc

Sc

25

Potential Causes: I. Design

Defects

• Different selection differentials (Plot is Rc vs. t,

not the correct plot of Rc vs. Sc)

• Drift (sample size not sufficiently large)

• Scale effects

• Undetected environmental trends

• Transient effects from previous selection

– Decay of epistatic response

• Undetected selection on correlated traits

26

Scale effects

When the trait biologically cannot go below

a specific value (i.e., 0), as we down-select towards

zero, expect less response.

Transform to a log

scale

27

Potential Causes: II. Nonlinear

Parent-Offspring regression

+S

+ h2S

Parent

Off

spri

ng

-S

- h2S

Linearity gives a symmetric response with

+S, -S

28

However, if PO regression is

nonlinear

+S

Parent

Off

spri

ng

-S

With this nonlinear regression, larger

absolute response for +S than for -S

29

Potential Causes: III.

Inbreeding depression

Change in mean due to

inbreeding depression.

True genetic response

in the absence of inbreeding

Depresses upward response,

Enhances downward response

30

Potential Causes: IV.

Genetic Asymmetry

• Requires changes in allele frequencies.

• The same absolute change in an allele frequency

can result in rather different changes in the

variance in the + vs. - change direction.

• This results in departures in the additive genetic

variance in up vs. down-selected lines, and hence

changes in h2 and response.

31

Additive variance, VA, with no dominance (k = 0)

Allele frequency, p

VA

When p = 1/2,

VarA(p+#) = VarA(p-#)

When p not 1/2,

VarA(p+#) not VarA(p+#)

p+#

p-#

p+#p-#

32

Additive variance, VA, with complete dominance (k = 1)

33

Additive variance, VA, with overdominance (k = 10)

34

Control Populations

zs,t = µ + gs,t + dt + es,t

zc,t = µ + gc,t + dt + ec,t

Until now, we have been ignoring the bias caused by

not accounting for any environmental trend.

One way to deal with this is to include an unselected

control population in the design

Hence,

E( zs,t − zc,t ) = E(gs,t )−E(gc,0 ) = h2SC(t)

35

E( zs,t − zc,t ) = h2SC(t) + (ds,t − dc,t)- -

−− −Rt = ( zs,t zc,t) ( zs,t"1 zc,t"1)

RC(t) = zs,t zc,t

St = z!s,t"1 − zs,t"1

-

Estimating trends with a control population

Complication 1: If G x E is present, then

Complication 2: Selection results in faster

inbreeding, so control must to comparatively inbred

to fully account for inbreeding depression

The use of a control also accounts for inbreeding depression

36

Divergent Selection Designs

zu,t = µ + gu,t + dt + eu,t

zd,t = µ + gd,t + dt + ed,t

Rt = ( zu,t− zu,t"1)− (zd,t− zd,t"1)RC(t) = zu,t − zd,t

St = ( z!u,t"1 − zu,t"1)− ( z!d,t"1− zd,t"1)

An alternative experimental design to remove

a common environmental trend is the divergent

selection design

Note that this design also accounts for inbreeding

depression (assuming up/down lines equally inbred)

Response estimated by

37

Variance in ResponseWe have been assuming that we can ignore !2

d.

With a control line and/or divergent selection, don’t have

to worry about this.

Control:

RC(t) = zs,t − zc,t = ( µ + gs,t + dt + es,t)− (µ + gc,t + dt + ec,t )= gs,t − gc,t + es,t − ec,t

The common dt term cancels

RC(t) = zsu,t− zsd,t = gsu,t − gsd,t + esu,t − esd,t

Divergent design

Again, common dt term cancels

38

1/Mu,t + 1/Md,t1/Nu + + 1/Ndfu,t + fd,tDivergent Selection,

no control

1/Ms,t + 1/Mc,t1/Ns + + 1/Ncfs,t + fc,tSelection in a single

direction, with control

1/Ms,t1/Nsfs,tSelection in a single

direction, no control

B (t > t’)AftDesign

σ2 [ RC(t) ] = (2ft +B0) 2ft h2σ2z +Bt σ2

z

" (t A + B0) h2σ2z + Bt σ2

z

σ [ RC(t), RC(t!) ] = (2ft + B0) h2σ2z

" (tA+ B0)σ2z h2 for t < t!

The resulting variance and covariances in response become

39

Variance with a Control

σ2(RC(t) ) =(

t

N+

1M0

)h2σ2

z +1M

σ2z −σ2

d

tσ2zh2

N> σ2

d

When does the use of a control population result

in a reduced variance?

Control populations are not without a cost.

Variance w/ control - variance without control =

Hence (ignoring M terms),

Regardless of the value of !2d, if sufficient generations are used, the optimal design (in

terms of giving the smallest expected variance in response) is not to use a control.

However, this approach runs the risk of an undetected directional environmental

trend compromising the estimated heritability.

40

Optimal Experimental Design

CV[ RC(t) ] =σ[RC(t) ]E[RC(t)]

(1/2Nt)1/2 /hi 2th2i!zDivergent Selection,

no control

(1/Nt)1/2/hi th2i!zSelection in one

direction, no control

(2/Nt)1/2/hi th2i!zSelection in one

direction, with control

CV [ R(t) ]E [ R(t) ]Design

The coefficient of variance (CV) provides one measure

for comparing different designs

CV scales with

Nt = total #

over the entire

experiment

41

ExampleSuppose we plan to select the upper 5% of the population

on a trait with h2 = 0.25

How large must N be to give a CV of 0.01 when no

control is used?

p = 0.05 --> i = 2.06. Assuming drift variance dominates

!2d, then

CV = 0.01 = (1/Nt)1/2/hi = (1/Nt)1/2/(0.5*2.06)

or Nt = 1/(0.01*0.5*2.06)2 = 9426

Hence, we must have at least 9,426 selected parents

over the course of the experiment

42

Moving from LS to MM and Bayes• Thus far, we have assumed a least-squares

(LS) analysis, which only uses informationfrom the generation means.

• When the pedigree is known, mixed-model(MM, e.g., the animal model) is much morepowerful, using all of the data and easilyhandling unbalanced design. MM is alikelihood analysis.

• Even better, a Bayes MM analysis fullyaccounts for model uncertainty (e.g., usingan estimated variance for BLUP).

43

Mixed-Model EstimationPROVIDED that we have the full pedigree of

individuals in the selection experiment, we can

use mixed-model methodology (e.g., BLUP & REML)

Power: Mixed-model accounts for ALL the

covariances in the sample, not just those

between means in different generations, but

also ALL of the covariances between related

individuals.

With sufficient connectiveness (links between relatives)

between generations, can estimate the genetic trend

without requiring any control population

44

Basic model: the animal model

yij = µ + aij + eij

Vectorize the

data as

y =

y0

y1

y2...yt

, where yi =

yi1

yi2...

yini

Var(a) = !2AA

y = 1µ + a + eThe (simple) model becomes

Here, Aii = (1+fi), Aij = 2$ij

Var(e) = !2e I

With additional fixed effects, y = Xb + a + e

45

The estimated mean in generation k is the average

of the estimated breeding values in generation k,

ak =1nk

nk∑

j=1

akj

Interesting complication: The BLUP estimate of a

requires a prior estimate of the heritability h2.

REML/BLUP (also called empirical BLUP). First

uses the data to obtain a REML estimate of Var(A),

then use this for the BLUPs

A Bayesian analysis fully accounts for the uncertainty

in using an estimate for Var(A)

46

REML/BLUP analysis vs. LS analysis

LS analysis uses regression to estimate realized

heritability.

BLUP produces smoother estimates, as individual BVs

are compared with an index based on their relatives.

BV values are regressed towards the value predicted

by the index, smoothing variations out

BLUP assumes a base population heritability and then

computes the genetic trend by plotting mean BV’s

47

The relationship matrix A fully accounts for the effects

of drift and the generation of linkage disequilibrium

(assuming the infinitesimal model holds).

This occurs because even in the face of drift and

selection the covariance matrix of breeding values

is the product of base-population !2A times the

relationship matrix (under the infinitesimal model)

This independence occurs because of the nature

of the breeding value regression

Ai = (1/2) Afi + (1/2)Ami + si

The segregation residual s (also referred to

as Mendelian sampling) is the variation generated

by heterozygous parents

48

Segregation residual, s, is the key here

Under the infinitesimal model, si is independent

of parental breeding values, with mean zero

Var(si) = (1- fi ) (!2A /2)

Where fi is the mean inbreeding for i’s parents,

and !2A is the base population additive variance

More generally, the vector s ~ MVN(0, [!2A /2] F)

Where F is a diagonal matrix with i-th element

Fii = (1− f i ) =(

1− fk + fj

2

)=

(2− Akk + Ajj

2

)

Distribution of s unaffected by BVs of parents, and

hence selection and/or assortative mating

49

Going BayesianA MCMC sampler for a Bayesian mixed model is

straightforward (see WL Appendix 3).

One important (but very subtle) issue is that PBVs

(predicted breeding values) are used in estimate

a genetic trend. The problem is that PBVs are

correlated, and hence have a GLS structure. This

is especially problematic with the smaller pedigrees

found in wild populations. Results in mean PBV-year

regressions being unbiased but highly anticonservative

(p values are highly biased towards small values)

Using the distribution of slopes in an MCMC sampler

avoids this problem

50

Example of Bayesian analysis in Large White pigs

Blasco et al (1998)

51

LS, MM, or Bayes?

52

Documents

Lecture 12: Analysis of Selection Experimentsnitro.biosci.arizona.edu/.../Lectures/Lecture12.pdf · Lecture 12: Analysis of Selection Experiments Bruce Walsh lecture notes Synbreed