Introduction to Linear Mixed Models for genetic problemshpc.ilri.cgiar.org/beca/training/AQGG_2016/materials/Mrode/ILRI... · Introduction to Linear Mixed Models for genetic problems

Partner Logo Partner

Logo

Introduction to Linear Mixed Models for genetic problems

Raphael Mrode

Training in quantitative genetics and genomics

30 May to 10th June 2016

ILRI , Nairobi

http://www.flickr.com/photos/ilri/sets/72157632057087650/with/8198718265/

2

Basic Problem

• Decomposing the sources of variation that contribute to observed phenotypes

• P = Genetic (G) + Environment (E)

• Thus Var(P) = Var(G) + Var(E)

– Sources of Var(E) could be environmental (e.g. season ) and management factors ( age at which animals are calved)

– Sources of Var(G) due to different forms of inheritance leading to different components of genetic variance (e.g. additive genetic variance, additive maternal genetic variance and so on.

3

Basic problem

• Accurate estimation of Var(G) or breeding values involves identifying and accounting for non-genetic systematic sources of variation (Var(E))

• This becomes more important as we often use field data which are subject to a variety of environmental factors and are not balanced

• A simple solution to account for environmental factors is to utilise records that are deviations from appropriate means.

• This has some limitations: – Need to account for the amount of information used to estimate

those means (e.g. is mean estimated for 5 records as precise as that from 100 records? )

– Usually several factors are involved and therefore not feasible.

4

Linear mixed model

• A framework that can be used to model phenotypic observations with account taken of all known effects seems optimum.

• The linear mixed model provides such a framework • Linear model consists of the

– The data vector (y) which is a set of observations on the study units, e.g. cows, deer, Sitka spruce trees, or humans.

– Usually assumed that the data follow a normal distribution. Univariate normal distribution, N(μ, V), if one trait. Multivariate normal distribution MVN(μ,V) if several traits

– Then model defines the factors (explanatory variables) which may have an effect on the observations. • Could be discrete (herd, year) or continuous (age).

• Some may be being central to the analysis (e.g. a treatment effect), whilst others may be what are called nuisance factors, which are expected to have an effect on the trait and are included to improve precision and avoid bias

5

Simple linear model and its matrix form

• Assume that cow fat yield on a single farm is influenced by how many lactations the cow has had and the amount of milk produced.

• Simple linear model (which is not necessarily biologically correct!) as: • yij = μ + αi + β (xij- xbar) + εij • where • yij = fat yield of the jth cow with i lactations • μ = overall mean fat yield • αi = effect of the ith lactation on fat yield • xij = milk yield, with xbar the mean milk yield in the data • β = regression coefficient of yield on days after calving

(slope) • εij = residual error •

6

Matrix form

• Model can be simplified in a matrix form:

• Suppose there are 4 cows in the data with two cows in first lactation and two cows in second lactation:

• y11 = μ + α1 + β (x11- xbar) + ε11

• y12 = μ + α1 + β (x12- xbar) + ε12

• y 21 = μ + α2 + β (x21- xbar) + ε21

• y 22 = μ + α2 + β (x22- xbar) + ε22

• We cannot simultaneously fit two terms for the effect of lactation as it only has 1 d.f., so we must introduce a linear constraint on α1 and α2, and here we shall use α1 + α2 = 0, so α2 = - α1.

• If yT= (y11, y12, y21, y22), bT = (μ, α1, β), eT = (ε11, ε12, ε21, ε22) and X =

• 1 1 (x11- xbar)

• 1 1 (x11- xbar)

• 1 -1 (x11- xbar)

• 1 -1 (x11- xbar)

• In matrix form previous equation can be written:

• y = Xb + e

•

• X = incidence matrix for the fixed effects as it describes the ‘incidence’ of the different fixed effects in the observed data. Also called a design matrix as, it encapsulates the design of the study.

7

Fixed versus random

• What was described above is a fixed effects model. – The word ‘fixed’ describes the way we interpret the factors

– The solutions are the same as ‘least squares’ solutions.

– However the fixed effects model is not the most useful model for genetic and genome analyses.

• We might consider that the set of effects observed are the result of

randomly sampling a much bigger distribution of effects. • Suppose we look at a sample of animals in a large population and

estimate the breeding values of their sires. Then we might consider the breeding values as a random sample being drawn from a distribution with a variance Vs.

• These effects are called ‘random effects’.

8

Fixed versus random

• No general consensus on classification of effects either as fixed or random.

• In general, factors are considered as fixed effects when

– All possible levels are represented in the study, e.g. gender, lactation numbers, or perhaps herds.

– Therefore inferences can be made with respect to all the levels. For example there are only two sexes and both are included in the study.

9

Fixed versus random

• In general, factors are considered as random effects when

– Levels consist of random samples drawn from levels in an infinite large population.

– Therefore inferences are made about the population levels rather than just the sub-set represented in the data.

– Usually sires, animals etc are considered random as repeated sampling may result in other animals being drawn from the population

10

Linear mixed models

• If explanatory variables in a model consist of both fixed effects and random effects, then it is called a mixed linear model.

• With random effects, the solutions will depend on the estimate of variance for the random effects used. Thus if random sire effects is assumed then variance Vs is to be estimated

• Usually assumed that random effects are drawn from a Normal

distribution, N(0, V) where V is the variance of the effects and mean = 0.

• For genetic effects in a mixed linear model, above assumption is equivalent to assuming that traits are determined by additive alleles of infinitesimal small effects at infinitely many unlinked loci. This is called the infinitesimal model.

11

Linear mixed model

• In matrix notation, a mixed linear model may be represented as • y = Xb + Za + e

• where • y = n x 1 vector of observations; n = number of records. • b = p x 1 vector of fixed effects; p = number of levels for fixed

effects. • a = q x 1 vector of random animal effects; q = number of levels for

random effects • e = n x 1 vector of random residual effects • X = design matrix of order n x p, that relates records to fixed effects • Z = design matrix of order n x q, that relates records to random

animal effects • Both X and Z are both termed design or incidence matrices.

12

Assumptions of the linear mixed model

• It is assumed that the expectations (E) of the variables are

• This is also known as the first moment while the second moments describe the variance-covariance structure of y.

• It is assumed that residual effects which include random environmental and non-additive genetic effects, are independently distributed with variance σ2

e, therefore, var(e) = Iσ2

e = R; var(a) = G = Iσ2a or Aσ2

a and cov (a,e) = cov(e,a) = 0. Thus

•

E(y) = Xb ; E(a) = E(e) = 0 =

0

0

Xb

e

a

y

E

R0

0G

e

aV

13

Assumptions of the linear mixed model

• var(y) = V = var( Za + e)

• = Z var(a) Z' + var(e) + cov(Za,e) + cov(e,Za)

• = ZGZ' + R + Zcov(a,e) + cov(e,a)Z‘

• Since cov(a,e) = cov(e,a) = 0, then

• V = ZGZ' + R

•

• cov(y,a) = cov(Za + e, a)

• = cov(Za,a) + cov(e,a)

• = Z cov(a,a)

• = ZG

• and

• cov(y,e) = cov(Za + e, e)

• = cov(Za,e) + cov(e,e)

• = Z cov(a,e) + cov(e,e)

• = R

14

Estimation and prediction with variances of random effects known

• From the linear mixed model we want to predict a linear function of b and a using a linear function of y

• The linear function of y is chosen such that it is unbiased and errors of prediction are minimised. This leads to the best linear unbiased prediction (BLUP) of a as:

• â = BLUP (a) = GZ'V-1(y-Xb )

•

• with b = (X'V-1X)X'V-1y (the generalised least square solution (GLS) for b) and k' b is the best linear unbiased estimator (BLUE) of k'b, given that k'b is estimable.

• BLUE therefore means

• Best - maximises the correlation between true and estimates values of fixed effects by minimising the error variance

• Linear - factors for which estimates are required are linear functions of the data

• Unbiased – estimates of fixed effects and estimable functions are such

• that

bbb ˆ)ˆ|E(

15

Mixed Model Equations

• The previous equation for the BLUP of a, require

V-1, which is not always computationally feasible with large data

• Henderson presented the mixed model equations (MME) to estimate b and a simultaneously without the need for computing V-1. The MME are

yRZ

yRX

a

b

G+ZRZXRZ

ZRXXRX

1-

1-

1-1-1-

1-1-

=ˆ

ˆ

16

MME

• Since R-1 = identity matrix it can be factored out from both sides of the equation to give:

• with α = σ2e/σ2

a

=+

yZ

yX

a

b

IZZXZ

ZXXX

ˆ

ˆ

17

Example Data

• --------------------------------------------------------

• Cow Herd Calving class Sire Test day milk yield (kg)

• -----------------------------------------------------

• 1 1 2 1 36.2

• 2 1 1 2 25.8

• 3 1 2 3 31.5

• 4 1 1 1 42.0

• 5 1 2 2 12.3

• 6 2 2 3 28.5

• 7 2 1 1 10.6

• 8 2 2 2 23.4

• 9 2 2 3 22.4

• 10 2 1 1 14.8

•

• -------------------------------------

18

Example-fixed effect model

• Initially, consider a fixed effect model with herd and calving class as the only fixed effects:

•

• y = Xb + e

•

• with b = solution vector for of herd and calving class.

• Then = (X’X)-1X’y.

• = ordinary least square solution which is a special case of GLS

• Assumes that all observations are uncorrelated and have a common variance σ2

e.

b̂

b̂

19

Example : fixed effect model

• Equations for 2nd herd set to zero to obtain above solutions. If first herd is set to zero, we get

• Note: linear contrasts are the same for both solution vectors

0110110101

1001001010

1111100000

0000011111

X

6033

0422

3250

3205

XX 154.393.299.7147.8yX )'

154.3

93.2

99.7

147.8

b

b

b

b

6033

0422

3250

3205

4

3

2

1

20.907

18.490

0.000

9.620

b̂

30.527

28.110

9.620

0.000

b̂

20

Hypothesis testing

• Usually, one of the main aims with respect to fixed effects is to test – if particular factors have significant effect on the observations.

– Test when factors significantly affect the observations and be left in the model or otherwise.

• The usual sums of squares needed are:

•

• SStotal = SST = = y’QTy , with QT = V-1

• SSmodel = SSR = y’QMy with QM = V-1X(X’V-1X)-1X’V-1

• and SSerror = SSE = y’QEy with QE = V-1 - V-1X(X’V-1X)-1X’V-1

•

•

• The ratio of two independent central chi-square variables has an F-distribution.

• Therefore the adequacy of the whole model can be tested as:

•

• FM = SSR/ r(X) / (SSE/ (N – r(X))

• Where N is the number of records and r(X) is the rank of the X.

21

Hypothesis testing

• Usually of interest is testing subsets of the solution vector b, for instance if the difference between the calving classes is significant or not.

• • Initially construct a function such as b4 – b3 as: • • K’b = (0 0 -1 1) b = b4 – b3 • For the example data K’b = 20.907 – 10.490 = 2.417 • • The test statistics is F = [ss / r(K’)] / (SSE/(N – r(X)) • Where the sum of squares of K’b = ss = (K’b)’(K’(X’X)-1K)-

1K’b

22

Hypothesis testing

• For the example data: • Therefore ss = 14.019 , since r(K’) = 1

• • SSE = y’y – bX’y = 7076.59 – 3225.95 = 3850.64. • • F = 14.019/ (3850.64/7) = 0.025 , which is not significant at

5% level. • • In ASREML, you can use ‘contrast’ to test hypothesis •

0.2670.1000.00.200

0.1000.3500.00.200

0.00.00.00.0

0.2000.2000.00.400

XX1)(

23

Example: Mixed linear model

• Data is analysed assuming that the sires are random and unrelated. It is also assumed that σ2

e = 100 and σ2s

= 200, therefore with

• α = σ2e/σ2

s = 100/200 = 0.50

• The transpose of the matrix Z that relates records to the sires is

300

030

004

and

0100100100

0010010010

1001001001

ZZZ

24

Example: Mixed linear model

• The MME are

82.4

61.5

103.6

154.3

93.2

99.7

147.8

a

a

a

b

b

b

b

3.5003021

03.502112

004.51322

3216033

0130422

2123250

1223205

3

2

1

4

3

2

1

25

Example : Mixed linear model

• Solving the MME by direct inversion gives the following solutions

•

• ---------------------------------------

• Fixed effects

• Herds

• 1 11.330

• 2 0.000

• Calving class

• 1 17.510

• 2 19.817

• Random sire effects

• 1 1.909

• 2 -5.230

• 3 3.320

• ----------------------------------------

26


• Accounting for the random effects of sires in the model has affected the solutions for the fixed effects.

• For the purposes of illustration, if α = σ2e/σ2

s = 100/5 indicating only a small variance for the sires, solutions are:

• Herds

• 1 9.866

• 2 0.000

• Calving class

• 1 18.308

• 2 20.766

• Random sire effects

• 1 0.341

• 2 -0.786

• 3 0.445

27


• When sire variance is small, the solutions for the fixed effects are more similar to those from the fixed effect model.

• Thus when random effects are present in a model with large variation the MME presents the best methodology

• Usually the appropriate estimate α is used in the analysis and it is estimated from the data

• The use of the appropriate α ensures that estimates of random effects are shrunk towards the mean

The presentation has a Creative Commons licence. You are free to re-use or distribute this work, provided credit is given to ILRI.

better lives through livestock

ilri.org

CRP and CG logos

Documents

Introduction to Linear Mixed Models for genetic problemshpc.ilri.cgiar.org/beca/training/AQGG_2016/materials/Mrode/ILRI... · Introduction to Linear Mixed Models for genetic problems