Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
GENETICS | GENOMIC SELECTION
Modeling Epistasis in Genomic SelectionYong Jiang and Jochen C. Reif1
Department of Breeding Research, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466 StadtSeeland, Germany
ABSTRACT Modeling epistasis in genomic selection is impeded by a high computational load. The extended genomic best linearunbiased prediction (EG-BLUP) with an epistatic relationship matrix and the reproducing kernel Hilbert space regression (RKHS) are twoattractive approaches that reduce the computational load. In this study, we proved the equivalence of EG-BLUP and genomic selectionapproaches, explicitly modeling epistatic effects. Moreover, we have shown why the RKHS model based on a Gaussian kernel capturesepistatic effects among markers. Using experimental data sets in wheat and maize, we compared different genomic selectionapproaches and concluded that prediction accuracy can be improved by modeling epistasis for selfing species but may not foroutcrossing species.
KEYWORDS epistasis; genomic selection; genomic best linear unbiased prediction (G-BLUP); extended G-BLUP (EG-BLUP); reproducing kernel Hilbert
space regression (RKHS); GenPred; shared data resource
EPISTASIS has long been recognized as an important com-ponent in dissecting genetic pathways and understanding
the evolution of complex genetic systems (Phillips 2008). It ishence a biologically influential component contributing tothe genetic architecture of quantitative traits (Mackay2014). The influence of epistasis on genome-wide QTL map-ping ranges from limited (e.g., Buckler et al. 2009; Tian et al.2011) to high (e.g., Carlborg et al. 2006; Würschum et al.2011; Huang et al. 2014). These discrepancies can be ex-plained by the complexities of the examined traits, whichare controlled by many loci exhibiting small effects entailinga low QTL detection power. In addition, the estimation ofQTL main and interaction effects is very likely biased (Beavis1994), which makes it challenging to reliably elucidate therole of epistasis through genome-wide QTL mapping studies.
Genomic selection has been suggested as an alternative totackle complex traits that are regulated by many genes, eachexhibiting a small effect (Meuwissen et al. 2001). Genomicselection approaches based on additive and dominanceeffects have been successfully applied to predict complex
traits in human (Yang et al. 2010), animal (Hayes et al.2009), and plant populations (Jannink et al. 2010; Zhaoet al. 2015). Moreover, several genomic selection approacheshave been developed tomodel bothmain and epistatic effects(Xu 2007; Cai et al. 2011; Wittenburg et al. 2011; Wang et al.2012). While in some studies prediction accuracies increased(Hu et al. 2011), in others modeling epistasis adversely af-fected prediction accuracies (Lorenzana and Bernardo 2009).
Despite these first attempts, epistasis is often ignored ingenomic selection approaches using parametric modelsmainly because of the high associated computational load,especially if a large number of markers are available. Anattractive solution to reduce the computational load is toextend genomic best linear unbiased prediction (G-BLUP)models (VanRaden 2008) by adding marker-based epistaticrelationship matrices [extended genomic best linear unbi-ased prediction (EG-BLUP)]. Dating back to Henderson(1985), EG-BLUP enables the estimation of epistatic compo-nents of the genotypic values without explicitly assessing in-dividually epistatic effects. Applied to predicting daily gain inpigs and the total height of pine trees, EG-BLUP outper-formed G-BLUP (Su et al. 2012; Muñoz et al. 2014). Theequivalence between G-BLUP and genomic selection approaches,with explicit relevance for modeling marker main effects, hasbeen demonstrated (Habier et al. 2007). However, the asso-ciation between EG-BLUP and genomic selection approachesexplicitly modeling marker main and interaction effects hasnot been studied.
Copyright © 2015 by the Genetics Society of Americadoi: 10.1534/genetics.115.177907Manuscript received May 5, 2015; accepted for publication July 23, 2015; publishedEarly Online July 27, 2015.Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.177907/-/DC1.1Corresponding author: Department of Breeding Research, Leibniz Institute of PlantGenetics and Crop Plant Research (IPK) Gatersleben, 06466 Stadt Seeland, Germany.E-mail: [email protected]
Genetics, Vol. 201, 759–768 October 2015 759
The use of semiparametric reproducing kernel Hilbertspace (RKHS) regression models has been promoted as analternative powerful option to capture epistasis in genomicselection (Gianola et al. 2006; Gianola and Van Kaam 2008).The RKHS model outperformed linear models that focusedexclusively on marker main effects in a number of studiesbased on simulated data (e.g., Gianola et al. 2006; Howardet al. 2014) and empirical data (e.g., Perez-Rodriguez et al.2012; Rutkoski et al. 2012; Crossa et al. 2013). Choosing anappropriate kernel, which can be interpreted as a relationshipmatrix among genotypes (i.e., individuals), is a central elementof model specification in RKHS regression (De Los Camposet al. 2010). Among all possible kernels, the Gaussian kernelhas been extensively used and is assumed to implicitly portraythe genetic effects including epistasis (Gianola and Van Kaam2008; Morota and Gianola 2014). The exponential functioninvolved in the Gaussian kernel is a nonlinear transformationof the additive inputs,which encodes a type of epistasis (Gianolaet al. 2014). Nevertheless, it has not been clarified how RKHSregression based on Gaussian kernels explicitly models epistaticeffects among different markers.
In this study, we aimed at (1) explaining how the marker-based epistatic relationshipmatrix used in EG-BLUPmodels isrelated toepistatic effects amongmarkers, (2)unravelinghowthe RKHS model based on a Gaussian kernel takes epistaticeffects among different markers into account, and (3) com-paring the prediction abilities of three models (G-BLUP, EG-BLUP, and RKHS), using several published experimental datasets.
Theory
Throughout this article, we use the following notations: Let nbe the number of genotypes, m be the number of genotypeshaving phenotypic records, and p be the number of markers.Let X ¼ ðxijÞ be the n3 p matrix of SNP markers, where xijequals the number of a chosen allele at the jth locus for the ithgenotype. Let xi be the ith row of the matrix X, which is themarker profile for the ith genotype. Let pj be the allele fre-quency of the jth marker. We do not necessarily assumeHardy–Weinberg equilibrium in the population.
The G-BLUP model with additive relationship matrix
The baseline model for comparison was the standard G-BLUPmodel focusing on additive effects,
y ¼ 1mmþ Zgþ e; (1)
where y refers to the m-dimensional vector of phenotypicrecords, 1m is an m-dimensional vector of ones, m is the pop-ulation mean, g is an n-dimensional vector of additive geno-typic values, Z ¼ ðzijÞ is the corresponding m3 n designmatrix allocating phenotypic records to genotypes (i.e.,zij ¼ 1 if the jth entry of g corresponds to the ith observationin y, and zij ¼ 0 otherwise), and e is anm-dimensional vectorof residual terms.
Without loss of generality, we subsequently assume thatm ¼ n and that Z is the identity matrix, leading to the simplerform of the model,
y ¼ 1nmþ gþ e; (2)
where y, 1n; m; and e are the same as defined in (1). Weassume that m is a fixed parameter, and g, e are randomparameters with e � Nð0; Is2
e Þ and g � Nð0;Gs2gÞ: G denotes
the n3 n genomic relationship matrix among all genotypes,calculated following VanRaden (2008) as G ¼ WW9=g;where g ¼ 2
Ppk¼1pkð12 pkÞ andW ¼ ðwijÞ is an n3 pmatrix
with wij ¼ xij 2 2pj: It was proved that the matrix Gapproaches the well-known numerator relationship matrixA as the number of markers increases (Habier et al. 2007).
EG-BLUP: an extended G-BLUP model comprisingadditive and additive 3 additive relationship matrices
Focusing exclusively on additive3 additive epistasis, the EG-BLUP model has the form
y ¼ 1nmþ g1 þ g2 þ e; (3)
where y, 1n; m; and e are the same as defined in (2). For eachgenotype, not only the additive genotypic values but alsoepistatic genotypic values are included in the model. Namely,g1 is an n-dimensional vector of additive genotypic values,and g2 is an n-dimensional vector of additive 3 additive ep-istatic genotypic values. We assume that m is a fixed param-eter, e � Nð0; Is2
e Þ; g1 � Nð0;Gs21Þ; g2 � Nð0;Hs2
2Þ; andCovðg1; g2Þ ¼ Covðg1; eÞ ¼ Covðg2; eÞ ¼ 0: Here the matrixG is the same as in the G-BLUP model. In an infinitesimalmodel, Henderson (1985) suggested using the Hadamardproduct of the additive relationship matrix by itself to obtainthe epistatic relationship matrix H. Translated to genomicrelationship, this yields
H ¼ G#G: (4)
This extended G-BLUP model was recently used by Su et al.(2012) and Muñoz et al. (2014).
When the number of markers is large, we proved that EG-BLUP is equivalent to the model EG-BLUP* with explicit ep-istatic effects of markers (see the Appendix),
y ¼ 1nmþXpi¼1
Wiai þXp21
i¼1
Xpj¼iþ1
ðWi �WjÞvij þ e; (5)
where y, 1n; m; and e are the same as before; Wi is the ithcolumn of the matrix W; ai is the additive effect of the ithmarker; Wi �Wj is the element-wise product of the two vec-torsWi andWj; vij is the additive3 additive epistatic effect ofthe ith and the jthmarker; and e is the vector of residual terms.We assume that m is a fixed parameter; ai � Nð0;s2
1=gÞ;vij � Nð0; 2s2
2=g2Þ; e � Nð0; Is2
e Þ; and no covariance amongai; vij; and e. The basic setting of EG-BLUP* in Equation 5appeared in Wittenburg et al. (2011) with different assump-tions on the parameters.
760 Y. Jiang and J. C. Reif
Note that the parameters in EG-BLUP* should be consid-ered in the framework of Fisher (1918). Namely, m is thepopulation mean, ai is the average effect of an allele for theith locus, defined as the regression coefficient of the geno-typic values on the number of the allele, and vij ði 6¼ jÞ is theepistatic deviation for the ith and the jth loci.
The extension of Equation 3 to include also higher-orderadditive3 additive genotypic values can be deduced using thesame method as in Henderson (1985). We need only to notethat the (k2 1)th-order epistatic relationshipmatrix is given byG#k ¼ G#G# � � �#G (theHadamard product of k copies ofG).
The RKHS regression model based on a Gaussian kernel
We consider the following model that is equivalent to RKHSregression (De Los Campos et al. 2010):
y ¼ 1nmþ gþ e: (6)
The notations are the same as in (2) and the assumptionsare e � Nð0; Is2
e Þ; g � Nð0;Ks2gÞ; where K ¼ ðkðxi; xjÞÞ is an
n3 n kernel matrix whose entries are functions of markerprofiles of pairs of genotypes. It is required that K satisfiesthe semipositive definite property
Pi; jaiajkðxi; xjÞ$ 0; for all
real numbers ai; aj: Mathematically, a number of matriceswould satisfy this property. For example, we may chooseK ¼ G whereby the RKHS model is equivalent to G-BLUP.
In this study,weconsideronly theGaussiankernel (Gianolaand Van Kaam 2008),
kðxi; xjÞ ¼ exp
"2kxi 2 xjk2
h
#; (7)
where k�k denotes the norm in the Euclidean space and h isa bandwidth parameter. As the matrix K serves as a geneticrelationship matrix among genotypes, the parameter h con-trols how fast the relationship between two genotypes decaysas the distance between the corresponding pairs of markervectors increases. The choice of the bandwidth parameter canbe optimized by applying a cross-validation or a Bayesianapproach, treating h as a random variable. Throughout thisstudy, we assume that h is known.
An explicit explanation of why the RKHS modelcaptures epistasis
We start by inspecting the kernel matrix (7) in more detail.Recall that the entries in W are defined as wij ¼ xij 2 2pj:Hence we have
kxi2xjk2 ¼Xpk¼1
ðxik2 xjkÞ2
¼Xpk¼1
ðwik2wjkÞ2 ¼Xpk¼1
w2ik þ
Xpk¼1
w2jk22
Xpk¼1
wikwjk:
Recall that G ¼ WW9=g: Thus the ði; jÞth entry of G isGij ¼
Ppk¼1 wikwjk=g: Write bl ¼
Ppk¼1w
2lk; for all 1# l# n:
Then we obtain
kðxi; xjÞ ¼ exp�2bih
�exp�2bj
h
�exp�2gGij
h
�:
Let 1n3n be the n3 n matrix of ones and let L ¼diagðexpð2ðb1=hÞÞ; . . . ; expð2ðbn=hÞÞÞ: Note that in termsof power series, expðxÞ ¼ 1þPN
k¼1ðxk =k!Þ (Levi 1968). Re-writing the above steps in matrix form, we have
K ¼ L~HL; (8)
where
~H ¼ 1n3n þXNk¼1
ð2gÞkhkk!
G#k: (9)
Therefore, we can see that the epistatic relationship matricesG#k (for each k$ 2) used in EG-BLUP are all involved in theGaussian kernel for the RKHSmodel. In this sense, the Gauss-ian kernel indeed carries the information of additive 3 addi-tive epistasis up to any order. But note that in the Gaussiankernel, the proportions of the additive and each epistatic re-lationship matrix G#k in the total matrix ~H are fixed, once thebandwidth parameter is chosen. In contrast, in EG-BLUP, theproportion of G#k in H depends on the corresponding vari-ance component, which is an unknown parameter to beestimated.
Based on the above observations, we can actually formu-lateamodelwithexplicit epistasis effectsofmarkersandprovethat it is equivalent to the RKHS model with the Gaussiankernel. Let us consider the followingmodel,which seems to beill-posed as infinitely many unknown parameters are in-cluded. But we immediately show that it is equivalent tothe RKHS model with Gaussian kernel,
y ¼1nmþ L1nn þXpi¼1
LWiai
þXNs¼2
X1# i1, i2,⋯, is# p
L
Yst¼1
Wit
!vi1i2⋯is þ e; (10)
where the notations y, 1n; m; Wi; ai; and e are the same as in(4).
Qst¼1Wit is the element-wise product of the vectors Wit
for 1# t# s: vi1i2⋯is are the sth-order epistatic effects amongthe i1; i2; . . ., and the is loci. We assume that m is fixed, vis an extra random intercept term with n � Nð0;s2
0Þ;ai � Nð0; ð2=hÞs2
0Þ; vi1i2⋯is � Nð0; ð2s=hsÞs20Þ; e � Nð0; Is2
e Þ;and there is no covariance among n; ai; vi1i2⋯is ; and e.
Now, let a be the p-dimensional vector ðaiÞ1#i#p; vðsÞ be the� p
s
�-dimensional vector ðvi1i2⋯isÞ1# i1,i2,⋯,is#p; and UðsÞ be
the n3� ps
�matrix whose columns consist of the
vectorsQs
t¼1Wit for all 1# i1; i2; . . . ; is # p: Here� ps
�¼
ðpðp2 1Þ⋯ðp2 sþ 1ÞÞ=s! denotes the binomial coefficient.With the above notations, Equation 6 can be rewritten inmatrix form as
Epistasis in Genomic Selection 761
y ¼ 1nmþ L1nn þ LWaþXNs¼2
LUðsÞvðsÞ þ e; (11)
with assumptions n � Nð0;s20Þ; a � Nð0; ð2=hÞIs2
0Þ;vðsÞ � Nð0; ð2s=hsÞIs2
0Þ; and e � Nð0; Is2e Þ and all covari-
ance terms are zero.Then we have
V ¼ varðyÞ ¼ ðL1n3nLÞs20 þ
2hLWW9Ls2
0
þXNs¼2
2s
hsLUðsÞUðsÞ9Ls2
0 þ Is2e :
Recall thatG ¼ WW9=g:Weneed to calculateUðsÞUðsÞ9 for anys$ 2: Note that in the case of s ¼ 2; we have shown in theAppendix that limp/Nð2Uð2ÞUð2Þ9=g2Þ ¼ limp/NG#G: This re-sult can be easily generalized for s. 2; using the samemethod. That is, for any s$ 2; we have
limp/N
s!UðsÞUðsÞ9g s ¼ lim
p/NG#s:
Thus, when p is very large, we can approximately treat
UðsÞUðsÞ9 � g s
s!G#s:
Then we can deduce that
V ¼ varðyÞ � L
1n3n þ
XNs¼1
ð2gÞshss!
G#s
!Ls2
0 þ Is2e
¼ �L~HL�s20 þ Is2
e :
Note that the matrix L~HL is exactly the Gaussian kernelK (Equation 8) and that the variance–covariance matrixV ¼ varð yÞ is exactly the same as in the RKHS model withGaussian kernel.
Using the same approach as in the Appendix, it is straight-forward to deduce that themodified RKHS (Equation 11) andthe RKHS models give the same predictions for the total ge-notypic values. Thus, we gave a complete explanation onwhythe RKHS model takes epistasis into account.
Comparing G-BLUP, EG-BLUP, and RKHS, usingexperimental data
We used two published data sets each in wheat and maizefor our study. The first data set consisted of 599 wheat linesgenotyped by 1447 diversity array technology (DArT)markers (Crossa et al. 2010). The second data set com-prised 254 advanced wheat breeding lines genotyped by1576 single-nucleotide polymorphism (SNP) markers(Poland et al. 2012). The third data set consisted of 300maize lines with 1148 SNP markers (Crossa et al. 2010).The fourth data set comprised two large half-sib maizepanels from the flint and dent heterotic pools (Baueret al. 2013). The dent (flint) panel consists of 847 (833)
lines with 31,498 (29,466) SNPs. The phenotypic trait onwhich we focused in this study was grain yield. Moredetails on the data sets are provided in supporting infor-mation, File S1.
Using the four data sets, we tested the option to increasethe predicting accuracy by modeling epistasis. To this end,we estimated the prediction accuracy based on the G-BLUP,EG-BLUP, andRKHSmodels, applyingfivefold cross-validations.The prediction accuracywasmeasured as the Pearson product-momentcorrelationbetweenpredictedandobservedgenotypicvaluesof the individuals in the test set(moredetailsonmethodsare included in File S1). We observed that the performance ofRKHSwas very similar to that of EG-BLUP (Table 1), which fitswell with our theoretical findings on the congruency of bothmodels. For the two reanalyzed maize data sets, EG-BLUP andRKHS including epistasis did not outperform G-BLUP ignor-ing epistasis. In contrast, in the two reanalyzed wheat datasets, we observed that the prediction accuracies for RKHS andEG-BLUP were consistently higher than that for the G-BLUPmodel.
Data availability
This study was based on published datasets. Detailed de-scription and the sources of all data sets were provided inFile S1.
Discussion
We focused in our study on digenic additive 3 additiveepistatic effects. Extending the EG-BLUP approach towardadditive 3 dominance and dominance 3 dominanceeffects or to higher-order epistasis is straightforward(Henderson 1985). It is important to note, however, thatbased on the framework used to partition the genotypicvariance, additive 3 additive effects are expected to bethe prevailing epistatic effects (Fisher 1918; Lynch andWalsh 1998).
EG-BLUP and RKHS are computational efficientapproaches to tackle epistasis in genomic selection
Extending genomic selection models toward epistasis isoften hampered by high computational load. We havedemonstrated that EG-BLUP is equivalent to genomic se-lection approaches modeling explicitly epistatic effects(EG-BLUP*, Equation 5). Moreover, RKHS can also bereformulated as a genomic selection model with explicitepistatic effects (modified RKHS, Equation 10). The com-putational load of EG-BLUP and RKHS mainly depends on thenumber of genotypes. In contrast, the computational load ofEG-BLUP* comprising additive as well as additive 3 additiveepistatic effects depends on the square of the number ofmarkers. Implementing the EG-BLUP and RKHS models fora previously published maize data set (Bauer et al. 2013)with 847 genotypes and 1000 randomly sampled markersis, for instance, up to 130 times faster compared with thecorresponding RR-BLUP approach. Consequently, EG-BLUP
762 Y. Jiang and J. C. Reif
and RKHS are promising models to routinely integrate epis-tasis in genomic selection studies.
Modeling epistasis improved the prediction accuracy inselfing but not in outcrossing species
We compared the cross-validated prediction accuracies, us-ing the G-BLUP, EG-BLUP, and RKHS models based on fourpublished data sets. Interestingly, we observed contrasttrends for wheat compared with maize on the performanceof models including epistasis (EG-BLUP and RKHS) andG-BLUP without considering epistasis. Namely, EG-BLUPand RKHS were superior to G-BLUP for the wheat data setsbut not for the maize data sets (Table 1). Hence, our resultssuggested that modeling additive 3 additive epistasis canincrease the prediction accuracy in genomic selection forselfing but not for outcrossing species. This is in line withrecent findings that additive 3 additive epistasis substan-tially affects midparent heterosis in the selfing species rice,but contributes only marginally to heterosis in the outcross-ing species maize (Garcia et al. 2008). Nevertheless, moreexperimental data sets are required to examine the role ofepistasis in selfing and outcrossing species in more detail. Inparticular, it seems attractive to study also the role of epis-tasis involving dominance effects, which entails specificdesigns such as factorial mating designs (Comstock andRobinson 1952).
In theEG-BLUPmodel, both theadditiveand theadditive3additive epistatic relationship matrices were derived frommolecular markers. If the markers under consideration arein linkage equilibrium (LE), the additive and additive 3 ad-ditive terms in EG-BLUP* are orthogonal in the sense ofCockerham (1954), and hence the estimates of additiveand epistatic effects are independent (Álvarez-Castro andCarlborg 2007). However, the assumption of linkage equi-librium may never be true in reality unless only a few locisparsely distributed on the genome are considered. Hence,
we performed a simulation study to investigate whetherlinkage disequilibrium (LD) among markers, which causesnonorthogonality of the model, has an influence on the per-formance of EG-BLUP.
Our simulation was based on the first wheat data set [599wheat lines with 1447 markers (Crossa et al. 2010)] and thedent panel of the second maize data set [847 lines with31,498 markers (Bauer et al. 2013)]. We simulated two sce-narios: (1) markers contributing to the trait are in LE and (2)markers contributing to the trait are in LD. In all cases, bothadditive and additive 3 additive epistatic effects were sim-ulated. The heritability was set to be 0.7. Details for thesimulation procedure are presented in File S1. We observedthat the prediction accuracy of EG-BLUP was consistentlyhigher than that of G-BLUP in both data sets and both sce-narios (Figure 1). Hence, we may conclude that LD amongmarkers has low influence on the effectiveness of EG-BLUP vs.G-BLUP.
Another factor that may affect the performance of EG-BLUP is inbreeding. In Henderson’s extended BLUP model(Henderson 1985), the derivation of the epistatic relation-ship matrix being the Hadamard square of the numeratorrelationship matrix depends on the assumption of randommating (Cockerham 1954), which may never hold for datafrom plant breeding. In our study, the marker-derived ep-istatic relationship matrix in EG-BLUP approximatelyequals the Hadamard square of the marker-derived addi-tive relationship matrix. This result relies only on the as-sumption that the marker additive and epistatic effects areindependent. Maybe this assumption is more likely tohold in noninbred than in inbred populations. If this istrue, the superiority of EG-BLUP over G-BLUP would bemore pronounced for noninbred than for inbred popula-tions, provided that epistasis substantially contributed tothe trait. An investigation of this problem is interesting butbeyond the scope of this study. Nevertheless, our results in
Table 1 Cross-validated prediction accuracies and standard errors of three genomic selection models (genomic best linear unbiasedprediction with additive relationship matrix (G-BLUP), extended G-BLUP with additive and additive 3 additive relationship matrices(EG-BLUP), and reproducing kernel Hilbert space regression based on the Gaussian kernel (RKHS)] in four data sets
Data set Trait–environmente G-BLUP EG-BLUP RKHS
Wheat_1a GY_E1 0.505 6 0.034 0.571 6 0.029 0.576 6 0.033GY_E2 0.493 6 0.034 0.500 6 0.034 0.499 6 0.034GY_E3 0.379 6 0.041 0.421 6 0.035 0.428 6 0.034GY_E4 0.484 6 0.033 0.525 6 0.029 0.526 6 0.034
Wheat_2b GY_drought 0.435 6 0.058 0.445 6 0.056 0.444 6 0.054GY_irrigated 0.537 6 0.046 0.550 6 0.046 0.556 6 0.042
Maize_1c GY_drought 0.429 6 0.044 0.440 6 0.045 0.449 6 0.043GY_irrigated 0.537 6 0.038 0.546 6 0.037 0.544 6 0.037
Maize_2d dent DMY 0.632 6 0.030 0.627 6 0.031 0.619 6 0.032Maize_2d flint DMY 0.651 6 0.020 0.649 6 0.021 0.643 6 0.021
The highest prediction accuracy for each trait in each data set is underlined.a Data set previously described in Crossa et al. (2010); 599 lines and 1447 DArT markers were used.b Data set previously described in Poland et al. (2012); 254 lines and 1576 SNP markers were used.c Data set previously described in Crossa et al. (2010); 264 lines and 1135 SNP markers were used.d Data set previously described in Bauer et al. (2013) and Lehermeier et al. (2014); 847 genotypes and 31,498 SNP markers were used for dent lines and 833 genotypes and29,466 SNP markers were used for flint lines.
e GY, grain yield; DMY, dry matter yield.
Epistasis in Genomic Selection 763
both simulation and empirical study indicated that EG-BLUPcan be effectively applied to noninbred plant data.
Enhancing prediction accuracy across a biparentalpopulation through modeling epistasis
Previous studies have shown that prediction accuracy is im-paired when performing genomic selection across connectedbiparental populations (Zhao et al. 2012; Riedelsheimer et al.2013). This may be explained at least partially by epistaticeffects as the genetic relatedness across connected popula-tions may be better exploited by modeling epistasis in addi-tion to additive effects. Again we used a publishedmaize dataset (Bauer et al. 2013) and investigated whether the pre-diction accuracy across connected biparental families can beincreased by modeling additive 3 additive epistasis. In ourscenario, genotypic values of the lines in one family werepredicted using lines from each of the other families. Wecompared the mean and maximal prediction accuracies foreach family and observed no superiority for EG-BLUP andRKHS (including epistasis) compared with G-BLUP (ignor-ing epistasis; Figure 2). The sizes of the biparental popula-tions were small, ranging from 17 to 133. This smallpopulation size can substantially reduce prediction accuracyexploiting epistasis, as has been shown previously for QTLmapping (Carlborg and Haley 2004). In addition, maize asan outcrossing species is likely to be influenced only little by
additive 3 additive epistasis in contrast to selfing species(Garcia et al. 2008). Therefore, it will be interesting to in-vestigate in future studies whether prediction accuracyacross connected biparental populations can be improved,modeling epistasis using large biparental populations inselfing species.
Acknowledgments
We thank Yusheng Zhao and Timothy Sharbel for theirvaluable comments on the manuscript. We thank the authorsin Crossa et al. (2010), Poland et al. (2012), Bauer et al.(2013), and Lehermeier et al. (2014) for making their datasets publicly available. We are grateful to all reviewers andthe editor for their helpful comments and suggestions,which greatly improved the manuscript. This study is basedon published data sets. The authors have no conflicts of in-terest to declare.
Literature Cited
Álvarez-Castro, J. M., and Ö. Carlborg, 2007 A unified model forfunctional and statistical epistasis and its application in quanti-tative trait loci analysis. Genetics 176: 1151–1167.
Bauer, E., M. Falque, H. Walter, C. Bauland, C. Camisan et al.,2013 Intraspecific variation of recombination rate in maize.Genome Biol. 14: R103.
Figure 2 Mean and maximal prediction accuracies of maize lines ineach family, using lines in each of the other families in the sameheterotic group (dent or flint) as the estimation set. The predictionaccuracies were evaluated using three different models [genomicbest linear unbiased prediction (G-BLUP), extended G-BLUP with ad-ditive and additive 3 additive relationship matrices (EG-BLUP), andreproducing kernel Hilbert space regression based on the Gaussiankernel (RKHS)].
Figure 1 The distribution of prediction accuracies of genomic bestlinear unbiased prediction (G-BLUP) and extended G-BLUP with addi-tive and additive 3 additive relationship matrices (EG-BLUP) in simu-lated data sets. Phenotypic traits were simulated for two data sets(wheat, 599 lines; maize, 847 dent lines) and two scenarios (LE, 100QTL in linkage equilibrium contributed to the trait; LD, 100 QTL inlinkage disequilibrium contributed to the trait). Among 5050 pairs ofQTL, 100 pairs were randomly chosen as epistatic QTL. The heritabilityof the simulated traits was 0.7. For each scenario the simulation wasrepeated 50 times.
764 Y. Jiang and J. C. Reif
Beavis, W. D., 1994 The power and deceit of QTL experi-ments: lessons from comparative QTL studies, pp. 250–266 in Proceedings of the Forty-Ninth Annual Corn and Sor-ghum Industry Research Conference, Vol. 1994, edited byD. B. Wilkinson. American Seed Trade Association, Wash-ington, DC.
Buckler, E. S., J. B. Holland, P. J. Bradbury, C. B. Acharya, P. J.Brown et al., 2009 The genetic architecture of maize floweringtime. Science 325: 714–718.
Cai, X., A. Huang, and S. Xu, 2011 Fast empirical Bayesian LASSOfor multiple quantitative trait locus mapping. BMC Bioinfor-matics 12: 211.
Carlborg, Ö., and C. S. Haley, 2004 Epistasis: Too often neglectedin complex trait studies? Nat. Rev. Genet. 5: 618–625.
Carlborg, Ö., L. Jacobsson, P. Åhgren, P. Siegel, and L. Andersson,2006 Epistasis and the release of genetic variation duringlong-term selection. Nat. Genet. 38: 418–420.
Cockerham, C. C., 1954 An extension of the concept of partition-ing hereditary variance for analysis of covariances among rela-tives when epistasis is present. Genetics 39: 859–882.
Comstock, R. E., and H. F. Robinson, 1952 Estimation of averagedominance of genes, pp. 494–516 in Heterosis, edited by J. W.Gowen. Iowa State College Press, Ames, IA.
Crossa, J., G. de Los Campos, P. Pérez, D. Gianola, J. Burgueñoet al., 2010 Prediction of genetic values of quantitative traitsin plant breeding using pedigree and molecular markers. Genet-ics 186: 713–724.
Crossa, J., Y. Beyene, S. Kassa, P. Pérez, J. M. Hickey et al.,2013 Genomic prediction in maize breeding populations withgenotyping-by-sequencing. G3 3: 1903–1926.
de Los Campos, G., D. Gianola, G. J. Rosa, K. A. Weigel, and J.Crossa, 2010 Semi-parametric genomic-enabled prediction ofgenetic values using reproducing kernel Hilbert spaces methods.Genet. Res. 92: 295–308.
Fisher, R. A., 1918 The correlation between relatives on the sup-position of Mendelian inheritance. Trans. R. Soc. Edinb. 52:399–433.
Garcia, A. A. F., S. Wang, A. E. Melchinger, and Z. B. Zeng,2008 Quantitative trait loci mapping and the genetic basis ofheterosis in maize and rice. Genetics 180: 1707–1724.
Gianola, D., and J. B. van Kaam, 2008 Reproducing kernel Hilbertspaces regression methods for genomic assisted prediction ofquantitative traits. Genetics 178: 2289–2303.
Gianola, D., R. L. Fernando, and A. Stella, 2006 Genomic-assistedprediction of genetic value with semiparametric procedures. Ge-netics 173: 1761–1776.
Gianola, D., G. Morota, and J. Crossa, 2014 Genome-enabledprediction of complex traits with kernel methods: What havewe learned? in Proceedings of the Tenth World Congressof Genetics Applied to Livestock Production. Vancouver, BC,Canada. Available at: https://asas.org/docs/default-source/wcgalp-proceedings-oral/212_paper_10331_manuscript_1636_0.pdf?sfvrsn=2.
Habier, D., R. L. Fernando, and J. C. M. Dekkers, 2007 The impactof genetic relationship information on genome-assisted breedingvalues. Genetics 177: 2389–2397.
Hayes, B. J., P. J. Bowman, A. J. Chamberlain, and M. E. Goddard,2009 Genomic selection in dairy cattle: progress and chal-lenges. J. Dairy Sci. 92: 433–443.
Henderson, C. R., 1975 Best linear unbiased estimation and pre-diction under a selection model. Biometrics 31: 423–447.
Henderson, C. R., 1985 Best linear unbiased prediction of non-additive genetic merits. J. Anim. Sci. 60: 111–117.
Howard, R., A. L. Carriquiry, and W. D. Beavis, 2014 Parametricand non-parametric statistical methods for genomic selection oftraits with additive and epistatic genetic architectures. G3 4:1027–1046.
Hu, Z., Y. Li, X. Song, Y. Han, X. Cai et al., 2011 Genomic valueprediction for quantitative traits under the epistatic model. BMCGenet. 12: 15.
Huang, A., S. Xu, and X. Cai, 2014 Whole-genome quantitativetrait locus mapping reveals major role of epistasis on yield ofrice. PLoS One 9: e87330.
Jannink, J. L., A. J. Lorenz, and H. Iwata, 2010 Genomic selectionin plant breeding: from theory to practice. Brief. Funct. Ge-nomics 9: 166–177.
Lehermeier, C., N. Krämer, E. Bauer, C. Bauland, C. Camisan et al.,2014 Usefulness of multiparental populations of maize (Zeamays L.) for genome-based prediction. Genetics 198: 3–16.
Levi, H., 1968 Polynomials, Power Series and Calculus. Van Nos-trand, Princeton, NJ.
Lorenzana, R. E., and R. Bernardo, 2009 Accuracy of genotypicvalue predictions for marker-based selection in biparental plantpopulations. Theor. Appl. Genet. 120: 151–161.
Lynch, M., and B. Walsh, 1998 Genetics and Analysis of Quantita-tive Traits. Sinauer Associates, Sunderland, MA.
Mackay, T. F., 2014 Epistasis and quantitative traits: using modelorganisms to study gene-gene interactions. Nat. Rev. Genet. 15:22–33.
Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard, 2001 Predictionof total genetic value using genome-wide dense marker maps.Genetics 157: 1819–1829.
Morota, G., and D. Gianola, 2014 Kernel-based whole-genomeprediction of complex traits: a review. Front. Genet. 5: 363.
Muñoz, P. R., M. F. Resende, S. A. Gezan, M. D. V. Resende, G. delos Campos et al., 2014 Unraveling additive from nonadditiveeffects using genomic relationship matrices. Genetics 198:1759–1768.
Pérez-Rodríguez, P., D. Gianola, J. M. González-Camacho, J.Crossa, Y. Manès et al., 2012 Comparison between linearand non-parametric regression models for genome-enabled pre-diction in wheat. G3 2: 1595–1605.
Phillips, P. C., 2008 Epistasis — the essential role of gene inter-actions in the structure and evolution of genetic systems. Nat.Rev. Genet. 9: 855–867.
Poland, J., J. Endelman, J. Dawson, J. Rutkoski, S. Wu et al.,2012 Genomic selection in wheat breeding using genotyping-by-sequencing. Plant Genome 5: 103–113.
Riedelsheimer, C., J. B. Endelman, M. Stange, M. E. Sorrells, J. L.Jannink et al., 2013 Genomic predictability of interconnectedbiparental maize populations. Genetics 194: 493–503.
Rutkoski, J., J. Benson, Y. Jia, G. Brown-Guedira, J. L. Janninket al., 2012 Evaluation of genomic prediction methods for fu-sarium head blight resistance in wheat. Plant Genome 5: 51–61.
Su, G., O. F. Christensen, T. Ostersen, M. Henryon, and M. S. Lund,2012 Estimating additive and non-additive genetic variancesand predicting genetic merits using genome-wide dense singlenucleotide polymorphism markers. PLoS One 7: e45293.
Tian, F., P. J. Bradbury, P. J. Brown, H. Hung, Q. Sun et al.,2011 Genome-wide association study of leaf architecture inthe maize nested association mapping population. Nat. Genet.43: 159–162.
VanRaden, P. M., 2008 Efficient methods to compute genomicpredictions. J. Dairy Sci. 91: 4414–4423.
Wang, D., I. S. El-Basyoni, P. S. Baenziger, J. Crossa, K. M. Eskridgeet al., 2012 Prediction of genetic values of quantitative traitswith epistatic effects in plant breeding populations. Heredity109: 313–319.
Wittenburg, D., N. Melzer, and N. Reinsch, 2011 Including non-additive genetic effects in Bayesian methods for the predictionof genetic values based on genome-wide markers. BMC Genet.12: 74.
Würschum, T., H. P. Maurer, B. Schulz, J. Möhring, and J. C. Reif,2011 Genome-wide association mapping reveals epistasis and
Epistasis in Genomic Selection 765
genetic interaction networks in sugar beet. Theor. Appl. Genet.123: 109–118.
Xu, S., 2007 An empirical Bayes method for estimating epistaticeffects of quantitative trait loci. Biometrics 63: 513–521.
Yang, J., B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders et al.,2010 Common SNPs explain a large proportion of the herita-bility for human height. Nat. Genet. 42: 565–569.
Zhao, Y., M. Gowda, W. Liu, T. Würschum, H. P. Maurer et al.,2012 Accuracy of genomic selection in European maize elitebreeding populations. Theor. Appl. Genet. 124: 769–776.
Zhao, Y., M. F. Mette, and J. C. Reif, 2015 Genomic selection inhybrid breeding. Plant Breed. 134: 1–10.
Communicating editor: F. van Eeuwijk
766 Y. Jiang and J. C. Reif
Appendix: A Proof of the Equivalence Between EG-BLUP and EG-BLUP* When the Number of Markers IsLarge
Let us start with the EG-BLUP* model (Equation 5). Let a be the p-dimensional vector of the ai’s and v be thepðp2 1Þ=2-dimensional vector of the vij’s. Let U be the n3 pðp21Þ=2 matrix whose columns are given by the vectorsðWi �WjÞ: Then Equation 5 can be simply written as
y ¼ 1nmþWaþ Uvþ e;
with assumptions a � Nð0; Iðs21=gÞÞ; v � Nð0; Ið2s2
2=g2ÞÞ; and e � Nð0; Is2
e Þ and all covariance terms are zero.Then we have
V ¼ varð yÞ ¼ WW9
gs21 þ
2UU9g2
s22 þ Is2
e : (A1)
The matrix UU9 is an n3n matrix whose ði; jÞ entry is given by
X1#k,s#p
ui;ksuj;ks ¼X
1#k,s#p
wikwiswjkwjs ¼ 12
X1#k;s#p
wikwiswjkwjs 2Xpk¼1
w2ikw
2jk
0@
1A
¼ 12
Xpk¼1
wikwjk
! Xps¼1
wiswjs
!2Xpk¼1
w2ikw
2jk
" #:
Then it is easy to deduce that
UU9 ¼ 12
��WW9
�#�WW9
�2 ðW#WÞðW#WÞ9
:
Hence we have
2UU9g2
¼ G#G2ðW#WÞðW#WÞ9
g2:
Now we claim that
limp/N
2UU9g2
¼ limp/N
G#G;
which means that when p is very large, we can approximately treat2UU9g2 � G#G: For this purpose we need only to prove
limp/N
ðW#WÞðW#WÞ9g2
¼ 0: (A2)
In fact, the ði; jÞth entry of the matrix ðW#WÞðW#WÞ9=g2 is
tij ¼Xp
k¼1w2ikw
2jk
4�Xp
k¼1pkð12 pkÞ
�2 ¼Xp
k¼1ðxik2 2pkÞ2ðxjk2 2pkÞ2
4�Xp
k¼1pkð12 pkÞ
�2 : (A3)
Note that we always exclude monomorphic markers in the analyses. So we can assume that p0 , pk , 12 p0; where p0 is thethreshold of minor allele frequency in the quality control (e.g., p0 ¼ 0:01 or 0:05). Then the numerator of (A3) is a sum ofp positive numbers, each belonging to the interval ½0; 16ð12 p0Þ2�;while the denominator is a sum of p2 positive numbers, eachin the interval ½4p20ð12 p0Þ2; 0:25�: Thus we have
0# limp/N
tij# limp/N
16ð12 p0Þ2p4p20ð12 p0Þ2p2
¼ limp/N
4p20p
¼ 0;
Epistasis in Genomic Selection 767
which proved (A2).Hence (A1) is simplified to the following:
V � Gs21 þ ðG#GÞs2
2 þ Is2e :
The right-hand side of the above formula is exactly the same as the variance–covariance matrix varðyÞ for Equation 3 inEG-BLUP.
By the results of Henderson (1975), the BLUPs of a and v are given by
a ¼ s21g
W9V21ðy21nmÞ; v ¼ 2s22
g2U9V21ðy21nmÞ; (A4)
where
m ¼ 19nV21y19nV211n
: (A5)
On the other hand, the BLUPs of g1 and g2 in the EG-BLUP model are given by
g1 ¼ s21GV
21ð y2 1nmÞ; g2 ¼ s22ðG#GÞV21ð y21nmÞ; (A6)
where m is the same as in (A5) as we have proved that the matrices V ¼ varðyÞ in EG-BLUP and EG-BLUP* are the same.Comparing (A4) and (A6), we see that g1 ¼ Wa and g2 ¼ Uv; confirming that EG-BLUP and EG-BLUP* give the same
predictions.
768 Y. Jiang and J. C. Reif
GENETICSSupporting Information
www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.177907/-/DC1
Modeling Epistasis in Genomic SelectionYong Jiang and Jochen C. Reif
Copyright © 2015 by the Genetics Society of AmericaDOI: 10.1534/genetics.115.177907
2 SI Y. Jiang and J. C. Reif
File S1
Supplementary Information
Data sets
In this study we used two published wheat and two published maize
data sets. The first data set consisted of 599 wheat lines genotyped
by 1,447 Diversity Array Technology (DArT) markers in the CIMMYT
Global Wheat Breeding Program (Crossa et al. 2010). Genotypic and
phenotypic data were downloaded from the corresponding
supplementary materials.
The second data set comprised 254 advanced wheat breeding
lines from the CIMMYT wheat breeding program, genotyped using a
genotyping-by-sequencing approach (Poland et al. 2012). Genotypic
and phenotypic data were downloaded from the corresponding
supplementary materials. 1,576 Single Nucleotide Polymorphism
(SNP) markers with lowest missing rate (<0.15%) were selected in
this study. Remaining missing values were imputed based on
marginal allele frequencies.
The third data set consisted of 300 maize lines from the
Drought Tolerance Maize for Africa project of CIMMYT Global Maize
Program genotyped with 1,148 SNP markers (Crossa et al. 2010).
Genotypic and phenotypic data were downloaded from the
corresponding supplementary materials. In this study we focused on
grain yield, which was examined for 264 lines.
The forth data set comprised two large half-sib maize panels
from the flint and dent heterotic pools generated within the European
PLANT-KBBE CornFed project (Bauer et al. 2013). The dent (flint)
panel consisted of 10 (11) half-sib families with 847 (833) doubled
haploid (DH) lines. Genomic data were downloaded from the website
of National Center for Biotechnology Information (NCBI) Gene
Y. Jiang and J. C. Reif 3 SI
Expression Omnibus as data set GSE50558
(http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50558).
After quality control for missing rate and minor allele frequency, the
number of SNP markers used in this study was 31,498 for dent lines
and 29,466 for flint lines. Field trials were described in Lehermeier et
al. (2014) and the phenotypic data were downloaded from the
corresponding supplementary materials.
Simulation study
The simulation was based on the first wheat data set (599 wheat
lines with 1,447 markers, Crossa et al. 2010) and the dent panel of
the second maize data set (847 lines with 31,498 markers, Bauer et
al. 2013).
For each data set we simulated traits in two scenarios: In the
LE scenario, we randomly selected 100 markers with pairwise LD (r2)
less than 0.06 as the causal QTL contributing to the trait. The
additive effects of the 100 QTL were independently sampled as a
normally distributed random variable with mean 0 and variance 1.
Then, we randomly sampled 100 pairs (among 5,050 pairs) of
markers as causal epistatic QTL pairs. The epistatic effects were
independently sampled as a normally distributed random variable
with mean 0 and variance 0.5. Setting the heritability to be 0.7, we
calculated the variance of environmental errors and the error terms
for each genotype were independently sampled as a normally
distributed random variable. Finally, we obtained the simulated trait
values by summing up the additive values, epistatic values and
environmental errors. In the LD scenario, we just randomly sampled
100 markers as causal QTL without considering LD and all other
procedures are the same as the independent case. For each data set
and each scenario, the simulation was repeated 50 times.
4 SI Y. Jiang and J. C. Reif
Evaluating prediction accuracies
The prediction accuracies of the three genomic prediction models
were evaluated by five-fold cross-validation with 20 replications. For
experimental data sets, the Pearson product-moment correlation
between predicted and observed total genotypic values of the
individuals in the test set was used as the measure of prediction
accuracy. For simulated data sets, the prediction accuracy was
defined as the correlation between predicted and true genotypic
values of the individuals in the test set. Standard errors of prediction
accuracies were estimated based on a bootstrap approach following
Rutkoski et al. (2012). All models were implemented using the R
package BGLR (Pérez and de los Campos 2014).
Supplementary References
Bauer, E., M. Falque, H. Walter, C. Bauland, C. Camisan et al., 2013
Intraspecific variation of recombination rate in maize. Genome
Biol. 14: R103.
Crossa, J., G. de Los Campos, P. Pérez, D. Gianola, J. Burgueño et
al., 2010 Prediction of genetic values of quantitative traits in
plant breeding using pedigree and molecular markers.
Genetics 186: 713-724.
Lehermeier, C., N. Krämer, E. Bauer, C. Bauland, C. Camisan et al.,
2014 Usefulness of Multiparental Populations of Maize (Zea
mays L.) for Genome-Based Prediction. Genetics 198: 3-16.
Y. Jiang and J. C. Reif 5 SI
Pérez, P., and G. de los Campos, 2014 Genome-wide regression &
prediction with the BGLR statistical package. Genetics 198:
483-495.
Poland, J., J. Endelman, J. Dawson, J. Rutkoski, S. Wu et al., 2012
Genomic selection in wheat breeding using genotyping-by-
sequencing. Plant Gen. 5: 103-113.
Rutkoski, J., J. Benson, Y. Jia, G. Brown-Guedira, J. L. Jannink et
al., 2012 Evaluation of genomic prediction methods for
fusarium head blight resistance in wheat. Plant Gen. 5: 51-61.