Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Analysis of Genotyping-by-Sequencing data in a maize (Zea mays L.)
breeding program
by
Alison R. Cooke
A Thesis
presented to
The University of Guelph
In partial fulfilment of requirements
for the degree of
Master of Science
in
Bioinformatics
© Alison R. Cooke, December, 2016
ABSTRACT
ANALYSIS OF GENOTYPING-BY-SEQUENCING DATA IN A MAIZE (ZEA MAYS L.)
BREEDING PROGRAM
Alison R. Cooke Advisors:
University of Guelph, 2016 Dr. Andy Robinson, Dr. Elizabeth A. Lee
As genotyping-by-sequencing data becomes more abundant, new applications of this
technology in breeding programs are possible. This research utilized publically available data in
conjunction with data from the University of Guelph maize breeding program. Using a genetic
similarity matrix, a network diagram and dendrogram were generated, reflecting the maize
heterotic patterns. Hierarchical clustering was used to identify putative parents of University of
Guelph lines from a publically available dataset. For lines derived from inter-heterotic breeding
crosses, parental linkage blocks were identified and visualized across the chromosomes. The
marker data was also used in conjunction with grain yield data for association mapping using a
mixed model approach. This method identified 123 quantitative trait loci for additive effects for
grain yield in Stiff Stalk lines, accounting for approximately 9.38% of the phenotypic variance.
Breeders can increase the utility of their marker data in their breeding program by applying the
methods described here.
iii
ACKNOWLEDGEMENTS
I would first like to thank my advisors, Dr. Liz Lee and Dr. Andy Robinson, for creating
this unique thesis project for me, combining the data analysis techniques of animal science with
phenotypic data from Liz’s maize breeding program. I am grateful for their support and guidance
with my research. They were both wonderful advisors.
Thank you to my advisory committee for providing insight from the statistical and plant
breeding perspectives. Special thanks to Dr. Gord Vandervoort, who was invaluable in assisting
me with my data analysis. He is extremely knowledgeable and always willing to offer advice
with a smile on his face.
I would like to thank the Natural Sciences and Engineering Research Council of Canada
(NSERC), and the Highly Qualified Personnel (HQP) scholarship programs for my Masters
scholarships. Finally, a thank you to Pioneer Hi-bred for granting me an internship as part of my
HQP scholarship – I learned a lot and had a great summer!
iv
TABLE OF CONTENTS
Acknowledgements ...................................................................................................................................... iii
List of Figures .............................................................................................................................................. vi
List of Tables ............................................................................................................................................. viii
List of Abbreviations Used .......................................................................................................................... ix
Chapter 1: Introduction ................................................................................................................................. 1
Chapter 2: Literature Review and Research Objectives ............................................................................... 4
2.1 Heterotic patterns in maize breeding .................................................................................................. 4
2.1.1 Development of hybrids from open pollinated populations ......................................................... 4
2.1.2 Privatization of U.S. maize breeding ........................................................................................... 5
2.1.3 The role of heterotic patterns in maize breeding .......................................................................... 5
2.1.4 U.S. heterotic patterns and founding inbred lines ........................................................................ 6
2.2 Use of genetic markers to investigate relationships of maize inbreds ................................................ 8
2.1.1 SNP data for genetic analysis of maize relationships .................................................................. 8
2.2.2 Hierarchical clustering with maize inbreds .................................................................................. 9
2.2.3 Network diagram to group related individuals ............................................................................. 9
2.2.4 Identifying parental genomic blocks in maize populations ........................................................ 10
2.3 Grain yield improvements in maize .................................................................................................. 11
2.3.1 Yield improvements in the 20th century ..................................................................................... 11
2.3.2 Role of planting density in yield increases ................................................................................ 12
2.3.4 Traditional QTL mapping for grain yield in maize .................................................................... 12
2.4 In silico association mapping with maize breeding program data .................................................... 13
2.4.1 Advantages of in silico association mapping ............................................................................. 13
2.4.2 In silico QTL mapping with maize breeding program data ....................................................... 14
2.4.3 Mixed models for in silico mapping .......................................................................................... 15
2.5 QTL mapping with North Carolina Design II data ........................................................................... 17
2.5.1 Description of North Carolina Design II .................................................................................... 17
2.5.2 QTL mapping with NCII breeding program data ...................................................................... 18
2.6 Research objectives and hypotheses ................................................................................................. 18
Chapter 3: Applications of Genotyping-by-Sequencing data in Maize Breeding Programs ...................... 21
3.1 Abstract ............................................................................................................................................. 21
3.2 Introduction ....................................................................................................................................... 21
3.3 Methods............................................................................................................................................. 26
v
3.3.1 Germplasm and marker data ...................................................................................................... 26
3.3.2 Network Diagram ....................................................................................................................... 30
3.3.3 Hierarchical clustering ............................................................................................................... 31
3.3.4 Identity by descent ..................................................................................................................... 33
3.4 Results and Discussion ..................................................................................................................... 34
3.4.1. Classifying Lines into heterotic patterns using GBS data ......................................................... 34
3.4.2. Identification of putative parents of lines of unknown parentage using GBS data ................... 39
3.4.3. Determining genome contribution of the parents from a breeding cross using GBS data ........ 48
3.5 Conclusions ....................................................................................................................................... 54
CHAPTER 4: IN SILICO MAPPING OF MAIZE GRAIN YIELD QTLS USING STRUCTURED
CROSSES ................................................................................................................................................... 55
4.1 Abstract ............................................................................................................................................. 55
4.2 Introduction ....................................................................................................................................... 56
4.3 Methods............................................................................................................................................. 59
4.3.1 Germplasm and Field Trials ....................................................................................................... 59
4.3.2 Molecular marker data ............................................................................................................... 61
4.3.3 Mixed Model Analysis ............................................................................................................... 62
4.4 Results and Discussion ..................................................................................................................... 63
4.4.1. Genomic relationship matrices .................................................................................................. 64
4.4.2 QTL detection using genomic and pedigree-based matrices ..................................................... 69
4.4.3 Informative SNPs used for in silico mapping ............................................................................ 74
4.4.4. QTL detection and NCII ........................................................................................................... 75
4.4.5. QTLs detected in Stiff Stalk ...................................................................................................... 76
4.4.6 Discussion of the mixed model approach .................................................................................. 78
4.6 Conclusions ....................................................................................................................................... 79
CHAPTER 5: GENERAL DISCUSSION .................................................................................................. 81
5.1 Applications of genotyping by sequencing data in maize breeding programs .................................. 81
5.2 In silico mapping of maize grain yield QTLs using structured crosses ............................................ 82
REFERENCES ........................................................................................................................................... 85
SUPPLEMENTAL TABLES AND FIGURES FOR CHAPTER 4: IN SILICO MAPPING OF MAIZE
GRAIN YIELD QTLS USING STRUCTURED CROSSES ..................................................................... 94
vi
LIST OF FIGURES
Figure 3.1. Network diagram for 1,081 public and ex-PVP lines and 110 Guelph lines generated
using IBS values greater than 0.3 (A). Points sharing the same colour (red, green,
pink or dark blue) indicate Guelph inbred lines derived from the same pedigree.
Other Guelph lines are coloured light blue. Founder inbred lines of heterotic patterns
are labelled. The germplasm clusters into three main heterotic pools: Stiff Stalk
(B73, B37, B14), Iodent (PH207) and Lancaster (Mo17, Oh43, Wf9). Network
diagram with labels, zoomed in to the region containing B73 (circled) showing the
level of detail generated in the diagram (B).………………………...…....…..……..36
Figure 3.2. Visual representation of the main division of branches in the dendrogram. The
branches containing tropical, popcorn, and sweet corn are labelled. The remaining
branches of the dendrogram contain U.S. field corn and Guelph lines, with the
locations of key founder individuals indicated…………………………...…………37
Figure 3.3. Subset of the dendrogram showing closest neighbours of PHJ40 (A) and PHR25 (B),
located on branches that diverge in the third split of the dendrogram, when
University of Guelph lines are not included in the dataset. When progeny from
PHJ40 x PHR25 are included in the dataset, these lines cluster together, with their
progeny (C).………………………………………………………………………....38
Figure 3.4. Histogram of the proportion of SNP alleles unaccounted for by either parent when
inbred lines and parents are randomly sampled from the dataset (n=10,000).……...44
Figure 3.5. Clustering of five siblings places all five lines with PHJ40 as the putative parent (A).
Removing SNPs with identical alleles between the lines and PHJ40 places each line
with PHG72, showing CG64 as an example (B).…………………………...……....45
Figure 3.6. Clustering of five siblings places four of the siblings together with LH85 as the
putative parent (A). For these four siblings, removing SNPs with identical alleles to
LH85 places each line with PHJ40, showing CG70 as an example (B). The fifth
sibling, CG42, clusters elsewhere (Figure 3.6)..………………………………...…..46
Figure 3.7. Initial clustering places CG42 with Tx714 and N201 (A). Removing SNPs with
identical alleles to N201 (B) or Tx714 (C) places CG42 with NC278 as the second
putative parent...……………………………………………………………………..46
Figure 3.8. Clustering of five siblings did not reveal a clear putative parent. When each sibling
was clustered individually, two of the siblings, CG57 (A) and CG65 (B), clustered
with SD102, with differing surrounding ex-PVP lines. The other three siblings,
CG37 (C), CG38 (D) and CG79 (E), clustered with the same set of ex-PVP lines,
which differs from the clustering in (A) and (B)..……………………...…….……..47
vii
Figure 3.9. Parental linkage block distribution across the seven inbreds derived from the two-
way breeding cross, PHJ40 (Stiff Stalk, blue) and PHR25 (Iodent, red). Of the SNP
alleles that could be assigned to a parent, the percentage of SNP alleles inherited
from the Iodent parent is shown...…………………………………………….……..49
Figure 3.10. Parental linkage block distribution across the three inbreds derived from the three-
way breeding cross. Linkage blocks in blue represent were inherited from CG102,
red from CG33, and green from CG65...…………………………………….……...50
Figure 3.11. The expected and observed percentage of parental contribution in the 3-way cross
derived inbred lines based on SNP alleles that could be assigned to a parent. Parental
lines: CG102 (blue), CG65 (green) and CG33 (red)..……………………..…….….50
Figure 3.12. Histogram of sizes of parental linkage blocks, with the frequency of blocks in each
bin expressed as a percentage of the total number of linkage blocks for the inbred
line.………………………………...……………………………………....………..52
Figure 3.13. Ideograms of SNP alleles shared by all seven siblings from the Stiff Stalk x Iodent
cross (A) and all three siblings in the 3-way cross (B). Regions in blue represent
PHJ40 and red represent PHR25 (A); regions in blue represent CG102, red CG33,
and green CG65 (B). Regions of dense SNP markers are indicated by green boxes
(A) ..………………………………...……………………………………..………..53
Figure 4.1. Scatter plot of Likelihood Ratio Test statistic (LRT) values for Stiff Stalk markers for
additive effects for grain yield using the G matrix (top) and pedigree matrix
(bottom). Lines indicate FDR adjusted q-value < 0.05...…………………..…….….70
Figure 4.2. Ideogram illustrating the genome coverage of the SNP markers used for in silico
mapping. Black bands show the position of (A) 27,398 SNPs in Stiff Stalk inbred
lines and (B) 32,401 SNPs in Iodent inbred lines...…………………………...……..75
Figure 4.3. Ideogram showing the chromosomal locations of the SNPs that are significantly
associated with regions influencing grain yield in the Stiff Stalk inbred lines..…….77
viii
LIST OF TABLES
Table 3.1. Pedigrees and families of Guelph inbred lines and Ex-PVP lines used to generate GBS
data. Pedigree information from E. Lee (unpublished data) and U.S. plant variety
protection certificates………………………………………………………..……..…26
Table 3.2. The percentage of SNPs unaccounted for by either parent for inbred lines of known
parentage…………………………………………………………………………..…43
Table 3.3. Number and size of parental linkage blocks identified in 10 University of Guelph
inbred lines derived from breeding crosses.………….…………………..………..…51
Table 3.4. IBD regions between the seven siblings of the Stiff Stalk x Iodent cross. These region
of dense marker coverage were selected based on visual inspection of the ideogram
(Figure 3.11). ………………………………………………………………....…..…53
Table 4.1. Pedigree of Stiff Stalk and Iodent inbred lines used in the NCII. Pedigree information
from E.A. Lee (Unpublished data) and United States plant variety protection
certificates. …………………………………………………………………….…..…60
Table 4.2. GBS-based relationship matrix for the Stiff Stalk inbred lines. …………………..…66
Table 4.3. GBS-based relationship matrix for the Iodent inbred lines…………………………..66
Table 4.4. Pedigree-based relationship matrix for the Stiff Stalk inbred lines. …………..……..67
Table 4.5. Pedigree-based relationship matrix for the Iodent inbred lines…………………..…..67
Table 4.6. The levels of LRT values observed in the scatter plot (Figure 2) correspond to
different patterns of alleles across the inbred lines. Level 1 indicates the highest LRT
value. The numbers for the alleles indicate the count of minor alleles (i.e. 0 =
homozygous major allele, 1 = heterozygous, 2 = homozygous minor allele)…..…....71
Table 4.7. Mapping with the pedigree-based matrix results more markers with significant
associations than mapping with the genomic relationship matrix. Likelihood ratio test
(LRT) statistics and corresponding adjusted p-value (q-value) for SNPs in the different
LRT levels are shown. Q-values < 0.05 are indicated with a *.…………………..….71
Table 4.8. For markers in the top three LRT levels, homozygosity for the major allele is
associated with high grain yield at commercial (74 k ha-1) and high (148 k ha-1)
population densities. The inbred lines are ordered according to the number of minor
alleles at loci in the top 3 LRT levels (see Table 4.6). Estimates of grain yield,
expressed as gi, are shown for each inbred line for each planting density (Holtrop
2016). These estimates reflect the difference between the average yield of all progeny
of a parental line and the average yield of all hybrids grown in an environment…….72
ix
LIST OF ABBREVIATIONS USED
ANOVA…………………………………………………………………….Analysis of Variance
Ex-PVP…………………………………………………………expired Plant Variety Protection
GBS………………………………………………………….………..genotyping-by-sequencing
GS……………………………………………………………………………….genomic selection
IBD………………………………………………….…..identical-by-descent/ identity-by-descent
IBS………………………………………………………..….. identical-by-state/ identity-by-state
LD…………………………………………………………………………..linkage disequilibrium
LogL…………………………………………….…………………………………..Log likelihood
MAS……………………………………………………………………..marker assisted selection
NCII…………………………………………………………………….. North Carolina Design II
PCA……………………………………………………………….…principal component analysis
PCOA………………………………………………………………...principal coordinate analysis
PVP……………………………………………………………………….Plant Variety Protection
QTL…………………..……………………………………………………..quantitative trait locus
REML……………………………....………………………...……restricted maximum likelihood
RFLP………………………………………………….restriction fragment length polymorphisms
RIL………………………………………………………………………...recombinant inbred line
SNP………………………………………………..……………...single nucleotide polymorphism
SSR………………………………………………………………………..simple sequence repeats
UPGMA………………….…………………unweighted pair group method with arithmetic mean
1
CHAPTER 1: INTRODUCTION
Maize (Zea mays L.) is a globally important crop used for food, animal feed, ethanol
production and many industrial purposes. With over 1 billion tonnes produced in 2013, maize
has the second highest global production of any crop, following sugar cane (FAO 2015). A
method of assessing a large number of SNP markers at low cost per sample, called genotyping-
by-sequencing (GBS), has been recently developed (Elshire et al. 2011). This method generates
large amounts of data per sample, and data for thousands of maize lines is publically available.
The clustering of maize lines into heterotic groups is a common technique used to understand the
germplasm structure within maize breeding programs. Recently, a new method of clustering,
called a network diagram, has been described (Romay et al. 2013). A comparison of this new
approach to the more common method of clustering, hierarchical clustering, is needed to
determine the merits of each method. Additionally, the development of novel applications of
GBS data to maize breeding programs are needed, to increase the potential use of this data.
This thesis demonstrates methods of using publically available GBS data in conjunction
with data from the University of Guelph maize breeding program. This thesis compares the
outcomes of clustering commercial maize germplasm, using the network diagram and
hierarchical clustering approaches, using GBS data for nearly 1,200 maize lines. This
comparison allows researchers to better understand the differences of the approaches, allowing
them to select the best method for their experiment. Then, this thesis describes novel applications
of GBS data, including identifying close relatives of maize lines from a dataset, and visualizing
the inheritance of parental linkage blocks along the chromosomes. This study presents the
comparison of current methods, and describes novel applications of GBS data, which facilitates a
greater understanding of maize germplasm as well as benefiting breeding programs.
2
The high marker density of GBS data also has applications in quantitative trait loci (QTL)
mapping. Maize breeders are continually seeking new QTLs for desirable agronomic traits, as
marker data is now an essential component of maize breeding programs (Bernardo and Yu
2007). There is now an opportunity to develop new methods, using the high marker density of
GBS data, to increase the efficiency of QTL discovery and the application of results to breeding
programs. While traditional QTL mapping is performed using linkage mapping, with a
segregating mapping population generated from a bi-parental cross, these results are not always
applicable to elite germplasm. An alternative method is association mapping, which uses
historical linkage disequilibrium to link genomic regions to phenotypes (Zu et al. 2008).
Association mapping can be performed using existing phenotypic data from elite germplasm to
detect QTLs through computational methods, referred to as in silico mapping (Grupe et al.
2001). Using elite germplasm to detect QTLs bridges the gap between research and application,
allowing results to be applied directly using marker assisted selection (Parisseaux and Bernardo
2004; Crepieux et al. 2005). Additionally, using existing data from breeding programs has
reduced cost compared to generating a traditional QTL mapping experiment (Parisseaux and
Bernardo 2004).
Previous studies have detected QTLs with in silico association mapping using phenotypic
data from hybrids in a breeding program (Parisseaux and Bernardo 2004; van Eeuwijk et al.
2010). However, these studies generated hybrids by crossing individuals from different heterotic
pools in various combinations, rather than systematically generating hybrids. This thesis
examines the potential of utilizing a structured mating design, North Carolina Design II (NCII)
(Comstock and Robinson 1952), and GBS data for in silico association mapping. The NCII is a
commonly used mating design, in which each member of a group of females is crossed with each
3
member of another group males to identify superior inbred lines and parental combinations. The
use of a systemic design facilitates the partitioning of the genotypic variance into effects due to
the female, effects due to the male, and effects due to the interaction of the male and female
using a two-way analysis of variance (ANOVA) (Hallauer et al. 2010a). This thesis explores the
use of NCII ANOVA to serve as a preliminary step to in silico QTL mapping, by ensuring that
sufficient additive genetic variation is present in the germplasm for the phenotype of interest.
This research describes a method of in silico association mapping using GBS data and existing
hybrid data for grain yield from an NCII, consisting of elite short season Stiff Stalk and Iodent
inbred lines. This research also investigates the potential of using GBS data to generate a
genomic relationship matrix, to substitute for the traditionally used pedigree-based additive
relationship matrix. This method of in silico QTL mapping using GBS data allows breeders to
increase the efficiency of their breeding programs by integrating QTL mapping with existing
phenotypic hybrid data from structured crosses, leading to directly applicable marker
associations to use in selection decisions.
4
CHAPTER 2: LITERATURE REVIEW AND RESEARCH OBJECTIVES
2.1 HETEROTIC PATTERNS IN MAIZE BREEDING
2.1.1 Development of hybrids from open pollinated populations
Maize was domesticated in tropical southern Mexico 7,000 to 8,000 years ago and has
since been cultivated along the North-South axis, adapting to more temperate climates (Troyer
1999). Following European colonization of the United States, there were two main groups of
open pollinated corn. Dent corn was grown in the south and had high yields due to breeding
efforts. Flint corn was grown in the north, was adapted to shorter growing seasons at northern
latitudes and had lower yields than dent corn. Breeders have developed higher yielding varieties
by incorporating traits such as early flowering and cold tolerance from flint corn into higher
yielding dent corn germplasm.
In the United States in 1900, there were an estimated 1,000 open pollinated cultivars that
had been generated from the crossing of Flint x Dent corn groups (Montgomery 1916; Troyer
1999). In open pollinated populations, the crosses were not controlled, but rather the seed
produced from desirable female plants was selected. The concept of hybrid maize was developed
by Shull (1908, 1909) and E.M. East (1909), with the development of inbred lines and hybrids
beginning in the 1920s. Hybrid maize had the advantage of both male and female parent
selection as well as the phenomenon of heterosis, in which the progeny of two unrelated parents
shows superior performance over the parents for biological traits such as stress tolerance and
yield. The initial inbred lines were derived from the self-pollination of virtually all of the 1,000
open pollinated varieties (Troyer 1999; Duvick 2005).
The hybrids developed in the 1920s were double cross hybrids, produced by crossing four
inbred parents, because single cross hybrids did not produce enough seed (Jones 1918). By the
1930s double-cross hybrid performance had exceeded that of open pollinated cultivars and
5
double-cross hybrids were widely sold in place of open pollinated populations (Crow 1998).
Over the next 30 years, breeding efforts resulted in increased grain yield of inbred lines, which
increased the yields of single cross hybrids (Troyer 1999). The commercialization of single cross
hybrids in the 1960s replaced double cross hybrids (Crow 1998).
2.1.2 Privatization of U.S. maize breeding
Since the 1970s, breeding has become increasingly privatized and the role of public
inbreds in commercial hybrids has continually declined (Darrah and Zuber 1986; Mikel and
Dudley 2006). Private companies began their breeding programs by utilizing the same public
germplasm in the 1920s and 1930s, and subsequently developing new inbred lines by self-
pollinating superior, commercial hybrids from their private breeding programs (Troyer 1999).
Over time, germplasm has been exchanged between breeding programs through the self-
pollination of competitors’ commercial hybrids (Troyer and Mikel 2010). Maize hybrids in the
U.S. are currently produced from propriety inbred lines protected by the U.S. Plant Variety
Protection Act (PVP), which was passed in 1970 (Mikel and Dudley 2006; Nelson et al. 2008)
and through plant patents. The act excludes others from using the protected inbred for 18 years,
or 20 years if registered after 1994. The practice of deriving inbred lines from competitors’
commercial hybrids stopped with the patenting of hybrids. Lines with expired PVP (ex-PVP) can
be used in both private and public breeding programs and for research purposes to gain insight
about the heterotic patterns used in proprietary germplasm.
2.1.3 The role of heterotic patterns in maize breeding
The breeding of field corn utilizes heterotic patterns, which function to make heterosis
predictable, when lines from different heterotic patterns are crossed. These heterotic patterns are
typically maintained by keeping the populations of inbred lines genetically distinct from each
6
other (Bernardo 2001). Hybrids derived from lines from different heterotic patterns generally
perform better than hybrids created from lines from the same heterotic pattern. The molecular
basis of heterosis is not yet understood. The two main theories include “dominance”, in which
there is a complementation between parent alleles which results in the masking of deleterious
alleles and the accumulation of favourable dominant alleles, and “over dominance”, in which
heterozygous combinations of alleles are superior to homozygous combinations of alleles
(Birchler et al. 2003).While neither of these theories can fully explain this phenomenon,
heterosis has been utilized in maize breeding for nearly a century.
New inbred lines are usually generated within a heterotic group by recycling elite lines,
which maintains genetic dissimilarity between the heterotic groups but also limits diversity
within the heterotic group (Mikel 2008). To increase genetic diversity in a breeding program,
breeders occasionally use commercial hybrids to generate new inbreds, which utilizes an inter-
heterotic cross and disrupts the heterotic patterns (Bernardo 2001). The success of developing
inbreds from an inter-heterotic cross depends on the selection of a suitable tester (Bernardo
2001). The effects of using an inter-heterotic cross for inbred development are not well
understood.
2.1.4 U.S. heterotic patterns and founding inbred lines
Modern heterotic patterns of North American germplasm can be difficult to discern,
considering the privatization of modern hybrid seed production. However, most elite inbred lines
in a heterotic group can be traced back to a few key public inbred lines, so heterotic patterns can
be described based on these key founder individuals (Darrah and Zuber 1986; Nelson et al.
2008).
7
U.S. maize heterotic patterns fit into two overarching groups, reflecting the major
division between heterotic patterns: Iowa Stiff Stalk Synthetic (Stiff Stalk) and non-Stiff Stalk
(Duvick 2005; Mikel and Dudley 2006; Mikel 2008). Stiff Stalk lines have been largely
influenced by the public inbred B73, and can also be traced back to founder lines B14 and B37
(Mikel and Dudley 2006; Mikel 2008).
The groups detected within non-Stiff Stalk germplasm depend largely on the material
analyzed. Mikel and Dudley (2006) used U.S. PVP records from 1980 to 2004 to group 685
proprietary lines into seven main groups: Stiff Stalk, Oh43, Lancaster, Oh07-Midland, Iodent,
Commercial hybrid derived, and Argentine Maiz Amargo. Mikel (2008) used PVP records for 55
lines and identified the following non-Stiff Stalk groups: Oh43/miscellaneous, Mo17/LH123,
LH82, Oh07/Midland Yellow Dent/Pioneer Female Composite and Iodent. The Iodent group can
be traced to the founder proprietary line PH207 (Troyer 1999; Mikel and Dudley 2006). The
Lancaster heterotic pattern can be traced back to the public line Mo17 and the private line LH51,
which was derived from Mo17 (Mikel and Dudley 2006). Another non-Stiff Stalk pattern is Wf9,
“Wilson Farm row 9”, which was a popular inbred line for hybrid development for 30 years after
its development in 1936 (Troyer et al. 1999).
While heterotic patterns can be divided into Stiff Stalk and non-Stiff stalk groups,
heterotic patterns have also been described as three main groups. Mikel and Dudley (2006)
report that the main groups in the 1950s U.S. germplasm were Stiff Stalk, non-Stiff Stalk and
Iodent. Darrah and Zuber (1986) outline the main groups in 1984 germplasm as Lancaster
(Mo17), Reid (B73, B37) and Iodent (I205 and private lines). The division of heterotic groups
and the assigning of lines to heterotic groups can be subjective and is largely dependent on the
germplasm analyzed.
8
2.2 USE OF GENETIC MARKERS TO INVESTIGATE RELATIONSHIPS OF MAIZE
INBREDS
2.1.1 SNP data for genetic analysis of maize relationships
While pedigree information is commonly used to determine population structure,
pedigree data is often not available or may be limited. Instead, molecular markers can be used to
analyze population structure and the relationships between inbred lines (Nelson et al. 2008).
Several types of markers have been used for genetic analysis of maize populations, including
restriction fragment length polymorphisms (RFLP) (Mumm et al. 1994), simple sequence repeats
(SSR) (Barata and Carena 2006), and single nucleotide polymorphisms (SNPs) (Nelson et al.
2011). SSR markers are multi-allelic and therefore more informative than bi-allelic SNPs.
Nelson et al. (2011) demonstrated that SNP and SSR markers can generate consistent clustering
of inbred lines when using a sufficient number of SNPs, in their case 305 SNPs compared to 150
SSR markers.
A procedure for high density SNP discovery, called Genotyping-by-Sequencing (GBS),
has been described by Elshire et al. (2011). This approach is feasible for high diversity, large
genome species such as maize, and can assess a large number of markers at low cost. This
method uses restriction enzyme digestion to reduce genome complexity by targeting lower copy
regions of the genome and avoiding repetitive regions. In this protocol, DNA is digested with the
restriction enzyme ApeKI, ligated to adapters, amplified, and then sequenced with Next-
Generation-Sequencing technology. The reads can then be aligned to the maize B73 reference
genome. Data from GBS has recently been used for genome-wide association analysis (GWAS)
and the investigation of population structure in maize (Romay et al. 2013).
9
2.2.2 Hierarchical clustering with maize inbreds
A common method of population structure analysis is hierarchical clustering, which has
been performed using several types of markers, beginning with RFLPs (Mumm et al. 1994), then
SSR (Barata and Carena 2006), and SNPs (Nelson et al. 2011; Wu et al. 2016). Within U.S.
germplasm, clustering with SNP data separates lines according to their heterotic pattern.
Clustering with 114 lines including ex-PVP and public inbreds using 639 SNP markers with the
unweighted pair group method with arithmetic mean (UPGMA) method resulted in six clusters,
with the main division in the dendrogram separating the Stiff Stalk groups (B73, B37, A632)
from non-Stiff Stalk groups (Mo17, Oh43, PH207) (Nelson et al. 2008). The six predominate
clusters are described by key founder lines in each cluster. Clustering using pair-wise distances
between 21 public inbreds, including exotic material, using 351,710 SNPs resulted in three
clusters: exotic, Stiff Stalk and non-Stiff Stalk (Hansey et al. 2012).
Hierarchical clustering with SNP data has been shown to separate U.S. field germplasm
from other material. Clustering with the UPGMA method using 511 SNP markers separated 689
inbreds into clusters of sweet corn, popcorn and tropical germplasm that were distinct from the
U.S. germplasm clusters (Hansey et al. 2011). However, this clustering analysis did not produce
clear divisions of heterotic patterns within U.S. field germplasm, suggesting the generating
clusters of heterotic patterns may be easier with smaller sample sizes or may be dependent on the
U.S. germplasm used in the analysis.
2.2.3 Network diagram to group related individuals
A new approach of grouping individuals, called the network diagram, has been recently
described by Romay et al. (2013). In this method, an Identity-by-State (IBS) similarity matrix
was created for 212 Ex-PVP lines using 620,279 SNPs with the software application PLINK
10
v1.07 (Purcell et al. 2007). Using the similarity matrix IBS values greater than 0.9, an algorithm
in the software Gephi dispersed the lines in 2-dimensional space based on the simultaneous
attraction of similar points and the repulsion of dissimilar points (Bastian et al. 2009). This
method does not place any restrictions on the number of clusters that are generated in the
diagram. The network diagram created by Romay et al. (2013) generated three main clusters of
germplasm, Stiff Stalk (B73), non-Stiff Stalk (Mo17/Oh43), and Iodent (PH207), which reflect
the heterotic patterns of the analyzed germplasm.
2.2.4 Identifying parental genomic blocks in maize populations
In breeding programs, breeders aim to select, in the progeny, the genomic segments
underlying the desirable traits of the parents. High density genotypic markers are ideal for
identifying signatures of selection, genomic regions that have been subjected to selection
pressure and likely contain genes underlying biologically important traits (Cadzow et al. 2014).
When a genomic region in two individuals contains alleles that were are inherited from a
common ancestor, this shared genomic segment is said to be identical-by-descent (IBD). High
density SNP data has been used to identify IBD regions in small populations of descendants from
three founder lines in Chinese maize breeding: Dan340, Huangzao4 and Mo17 (Liu et al. 2015).
The population sizes were 13, 20 and 23 inbreds for Dan30, Huangzao4 and Mo17, respectively.
In each population, the lines were genotyped for 40k SNPs, and genomic regions common
among all descendants, that also matched the founder parent, were identified as IBD regions,
using a minimum block size of 2 SNPs. Over a thousand IBD regions were detected in each
population: 1,262 in Dan340, 1,373 in Mo17, and 1,019 in Huangzao4. The maximum length of
IBD segment detected was 4.4 Mbp and contained 25 SNPs. These identified IBD regions may
be due to selection pressure, or may be due to genetic drift in the breeding population.
11
2.3 GRAIN YIELD IMPROVEMENTS IN MAIZE
2.3.1 Yield improvements in the 20th century
One of the main targets of maize breeding programs has been increased grain yield,
which is affected by both biomass accumulation as well as the partitioning of above ground
biomass to the grain (Lee and Tollenaar 2007). Selection for grain yield in U.S. commercial
populations has led to a six fold increase in yield from 1939 (1,300 kg/ha) to 2005 (7,800 kg/ha)
as well as enhanced abiotic stress tolerance (Lee and Tollenaar 2007). While the development of
double-cross and single-cross hybrids, introduced in the 1930’s and 1960’s, respectively, has
contributed to increased corn yields, the extent of heterosis has not changed over the years,
suggesting that yield improvement since the development of hybrids can be attributed to the
improvement of the inbred lines (Troyer 1999; Duvick 2001).
The improvement of corn yields since the 1930’s has been due to both plant breeding and
improved management, with 40-50% of gains attributed to management (Duvick 2005).
Management changes include: farming machinery, nitrogen fertilizer, herbicides, pest control
and higher density planting (Troyer 1999; Duvick 2005). Improvements to farming practices are
plateauing and future increases may be more dependent on genetic gain than changes in
management (Duvick 2005). The remaining 50-60% of yield gains is attributed to genetic
improvements, including improved efficiency of grain production (such as reduced branching
and weight of tassels and more upright leaves) and improved tolerance to biotic and abiotic stress
(such as heat and drought tolerance) (Duvick 2005). Breeding has also improved tolerance to
weed interference, dense planting, low soil nitrogen, low soil moisture, disease, and drought, as
well as adaptation to wider climatic regions (Troyer 1999; Duvick 2001; Tollenaar and Lee
2002). Most of the yield improvement is due to selection of genotypes that are responsive to new
12
management practices (Tollenaar and Lee 2002). Specifically, maize has been bred for
responsiveness to high nitrogen fertilizer applications and tolerance to high density planting
(Troyer 1999).
2.3.2 Role of planting density in yield increases
Increases in grain yield per hectare in modern maize is not due to more grain per plant
but is instead due to increased tolerance of hybrids to higher plant densities, allowing more
plants per hectare (Duvick 2005; Guo et al. 2011). The density of maize planting has increased
continually since the 1950’s (Duvick 2005). In the 1930’s, plant density averaged 30,000 plants/
ha and this rose to 40,000 plants/ ha in the 1960s, 60,000 plants/ ha in the 1980s, and 80,000
plants/ ha in the mid-2000s (Duvick 2005). Abiotic stress is heightened in high density
environments because resources such as solar radiation, soil nutrients and soil moisture are
limiting (Tollenaar and Lee 2002). Tolerance to high density planting can be attributed to
increased abiotic stress tolerance, since the plants are subject to chronic abiotic stress in high
density environments (Duvick 2005).
2.3.4 Traditional QTL mapping for grain yield in maize
Yield improvement efforts now incorporate genotypic data as an essential component of
plant breeding (Bernardo and Yu 2007). With the decreasing costs of markers, genotypic data
can now play a large role in breeding through marker assisted selection (MAS) and breeding
value prediction using genome-wide selection. Marker data is especially useful for traits with
low heritability such as maize yield (Moose and Mumm 2008). Quantitative trait loci (QTLs) are
regions of the genome linked to or containing some of the genes underlying a quantitative trait,
which is influenced by many genes and environmental factors.
13
Traditional QTL mapping for grain yield has been performed using linkage mapping,
which utilizes a bi-parental cross to generate a mapping population. The recombination events
that occur from the bi-parental cross and subsequent inbreeding of the F2 lines creates linkage
disequilibrium between the markers and QTLs in this mapping population, facilitating QTL
detection. Previous studies have identified maize grain yield QTLs using linkage mapping with
populations composed of: F3 derived from segregating F2 population (Ribaut et al. 1997), F2:3
families derived from F2 individuals (Malosetti et al. 2008), and F2:3 and F6:7 recombinant
inbred lines (RILs) from a bi-parental cross (Austin and Lee 1996). Grain yield has also been
mapped using testcross populations from backcrossed families crossed with a tester (Ho et al.
2002), F3 lines derived from segregating F2 crossed to two testers (Melchinger et al. 1998), and
F5 lines derived from F2 crossed to a tester (Boer et al. 2007). These identified QTLs provide
insight into the genetic basis of maize yield and the loci underlying yield differences between
maize lines.
2.4 IN SILICO ASSOCIATION MAPPING WITH MAIZE BREEDING PROGRAM
DATA
2.4.1 Advantages of in silico association mapping
An alternate approach to linkage mapping is association mapping, also referred to as
linkage disequilibrium mapping or genome wide association study (GWAS). This approach
utilizes historic linkage disequilibrium, arising from historical recombination events, in a
population of related individuals (Zu et al. 2008). The use of existing phenotypic and genetic
data to detect QTLs is referred to as in silico mapping (Grupe et al. 2001). This approach uses
computational methods to identify QTLs by using the genetic variation and linkage
14
disequilibrium in the germplasm analyzed. This linkage disequilibrium is present in the entire
population analyzed, rather than only being present in an experimentally generated population.
In silico association mapping can be performed using elite germplasm, which integrates
QTL discovery with crop breeding. This has a number of advantages over traditional linkage
mapping: (1) As the set of materials that the QTLs are detected in is elite germplasm, the
information can be used directly for marker assisted selection in a breeding program (Parisseaux
and Bernardo 2004; Crepieux et al. 2005); (2) Phenotypes can be generated with environmental
replicates, reducing environmental effects (Zhang et al. 2005); (3) Using existing phenotypic
data has reduced cost compared to generating and assessing phenotypes for a large mapping
population (Parisseaux and Bernardo 2004); and (4) A dataset of elite germplasm has greater
potential for QTL discovery because the parents of the bi-parental cross are likely to be
monomorphic for some markers and will have limited allelic diversity compared to the
population as a whole (Parisseaux and Bernardo 2004; van Eeuwijk et al. 2010; Bink et al.
2012).
2.4.2 In silico QTL mapping with maize breeding program data
This in silico mapping approach can either be applied to an association mapping panel or
to hybrid data from a maize breeding program. An association mapping panel is a collection of
elite inbred maize lines that are genotyped and phenotyped to use in GWAS. In silico QTL
mapping using inbred lines from maize breeding programs has been demonstrated by Romay et
al. (2013) and Zhang et al. (2005) for traits related to flowering time. Alternatively, mapping can
use phenotypic data for maize hybrids, generated in a breeding program, and genotypic data for
parental inbred lines. This approach detects QTLs in the parental inbred lines, with QTLs
detected in each heterotic pattern separately. This approach, as compared to an association
15
mapping panel, has several advantages: (1) The use of hybrid data can capture the heterotic
phenotype of traits such as grain yield and plant height, which cannot be achieved using
phenotypic data of parental lines; (2) This approach allows for detection of QTLs unique to a
heterotic pattern, as well as allows for cross-validation of QTLs between heterotic patterns
(Parisseaux and Bernardo 2004); and (3) Hybrid trials in breeding programs are conducted in
environments to which the hybrid is adapted, ensuring that detected QTLs are relevant to
environments used for crop production.
Previous studies have reported in silico QTL mapping using propriety maize breeding
program hybrid data. Parisseaux and Bernardo (2004) detected QTLs for grain moisture, plant
height and smut resistance using 96 SSR markers and 22,774 hybrids from the Limagrain
genetics program (France). The hybrids were generated from 1,266 inbred lines using nine
combinations of the nine heterotic groups. van Eeuwijk et al. (2010) detected QTLs for ear
height, plant height and yield using 769 SNPs and hybrid phenotypic data from Pioneer Hi-bred
International (Johnston, IA). The germplasm included 1,700 hybrids generated by crossing lines
from two heterotic groups. These studies generated hybrids by crossing individuals from
different heterotic pools in various combinations, rather than systematically generating hybrids.
2.4.3 Mixed models for in silico mapping
In silico mapping utilizes a mixed model that is fit to the data and then used to identify
loci significantly associated with the phenotype of interest. The general mixed model is as
follows: phenotypic observations = fixed effects (for environment and replicates) + additive
effects associated with marker alleles in heterotic group 1 parents + additive effects associated
with marker alleles in heterotic group 2 parents + additive effects not associated with markers of
heterotic group 1 parents (polygenic effects) + polygenic effects for heterotic group 2 parents +
16
residual effects (error). The two papers describing in silico mapping with hybrid phenotypic data
used different mapping approaches: association mapping (Parisseaux and Bernardo 2004) and
interval mapping (van Eeuwijk et al. 2010). The association mapping approach tests each marker
for linkage disequilibrium with a QTL. In the mixed model, the markers were treated as fixed
effects and the model parameters were estimated using the Restricted Maximum Likelihood
(REML) approach. The differences of marker allele effects were computed and tested for
significance using z-tests, for each of the heterotic patterns in the nine crosses used, for a total of
18 opportunities to detect a QTL. A marker loci was considered significant if it had at minimum
one significant z-test.
The interval mapping approach, first described by animal breeders (George et al. 2000),
was used by van Eeuwijk et al. (2010). This approach tests for QTLs at positions between
markers, at chosen intervals along the genome. For each interval, a genomic relationship matrix
was calculated using marker and pedigree data. In the mixed model, the marker effect was
treated as random and the variance/covariance structure was defined by the genomic relationship
matrix. The effects of the parameters in the model were then estimated using REML. For each
interval position, the likelihood of the model with a QTL effect was then compared to the
likelihood of a model with no QTL effect using the Likelihood Ratio Test statistic.
Association mapping is simpler and less computationally demanding than interval
mapping (Tanksley 1993). However, interval mapping is superior to association mapping for
estimating QTL locations and effects when there is low marker coverage of the genome, since
association mapping requires linkage disequilibrium between the potential QTL and a marker
(Tanksley 1993). The power of the association mapping approach has been assessed using
simulations of maize hybrid data (Yu et al. 2005). This research discovered that higher power for
17
QTL detection is achieved with a large sample size (2,400 hybrids vs 600), high marker density
(400 vs 200 markers), high heritability (0.7 vs 0.4) and a small number of QTLs underlying the
trait (20 vs 80). With dense marker coverage (markers < 15 cM apart), results are comparable
between interval and association mapping (Stuber et al. 1992; Tanksley 1993; Zhu et al. 2008).
With GBS data, the number of markers that can be generated is much larger than the number
used in these previous studies (Elshire et al. 2011). The dense marker coverage of GBS data has
already been used for in silico GWAS mapping of flowering time in maize (Romay et al. 2013).
This dense marker data has not yet been applied to hybrid data from a maize breeding program.
2.5 QTL MAPPING WITH NORTH CAROLINA DESIGN II DATA
2.5.1 Description of North Carolina Design II
North Carolina Design II (NCII) is a mating design commonly used in maize breeding,
and can be used with any crop making use of heterosis (Comstock and Robinson 1952). This
mating design is used to assess superior parental inbreds through general combining ability and
superior parental combinations through specific combining ability. The NCII, also called a
factorial mating design, was originally developed for livestock, but it has become routinely used
in maize. The NCII is systematic, in which each member of a group of females is crossed with
each member of another group males. When used in maize, the group of females belongs to one
heterotic pattern and the group of males belongs to another heterotic pattern. This design can be
used to partition the genotypic variance into effects due to the males, females and their
interaction using a two-way ANOVA (Hallauer et al. 2010a). The genetic effects due to males
and females represent additive genetic effects, while the genetic effects due to the interaction
represent non-additive genetic effects (Rojas and Sprague 1952).
18
2.5.2 QTL mapping with NCII breeding program data
The NCII is an ideal mating design for the integration of QTL mapping with existing
maize hybrid data. This mating design in popular in maize and is an efficient method to assess
whether there is sufficient additive genetic effects in each heterotic groups to warrant a QTL
mapping experiment with the data. While mapping with NCII data has not yet been described, an
association mapping study did detect QTLs for seed oil content in rapeseed using a partial NCII
design with 205 SSR markers assessed in the parental lines (Bu et al. 2015). This mixed model
included only marker data, and did not include terms for the relationship between lines or
environmental conditions. Further research is needed to develop methods of integrating QTL
mapping with NCII data from plant breeding programs, utilizing high density markers.
2.6 RESEARCH OBJECTIVES AND HYPOTHESES
This thesis project utilized data from several sources. Firstly, 16 Ex-PVP lines and 126
inbred lines from the Guelph breeding program, were assessed at 955,690 SNPs through GBS by
the Genomic Diversity Institute at Cornell. These lines belong to Stiff Stalk, Iodent and
Lancaster heterotic patterns. Secondly, publically available GBS data for 17,280 lines, as well as
the pedigree information for these lines, were obtained from panzea.org. This dataset includes
U.S. public germplasm, U.S. ex-PVP lines, tropical germplasm, popcorn, and sweet corn. These
two datasets were combined into one GBS dataset. Thirdly, the QTL mapping component of this
research utilized phenotypic data for grain yield from a NCII experiment at the University of
Guelph. This NCII crossed 11 Stiff Stalk lines to 10 Iodent lines. These hybrids were assessed
for grain yield at three locations near Guelph over three years. The parental lines used in the
NCII are included in the GBS dataset described above.
19
Objective 1: To investigate population structure using a network diagram and hierarchical
clustering.
o Hypothesis 1. The network diagram groups lines into three main nodes, reflecting three
main heterotic patterns: Stiff Stalk, Lancaster/non-Stiff Stalk and Iodent.
o Hypothesis 2. Hierarchical clustering separates U.S. Ex-PVP field germplasm from
tropical, popcorn and sweet corn germplasm, but does not generate clearly defined
heterotic groups within field corn.
Objective 2: To identify Ex-PVP relatives of Guelph lines using hierarchical clustering.
o Hypothesis 1. Guelph lines cluster with the Ex-PVP line from the public dataset with
greatest genetic similarity. After removing SNPs with identical alleles between the
Guelph line and the first parent/ relative, the Guelph line then clusters with the second
parent/ next closest relative.
o Hypothesis 2. Of the clustering neighbours, an inbred line will share the greatest
percentage of SNP alleles with a parent, over other relatives.
o Hypothesis 3. If the correct parents are identified, 100% of the alleles in an inbred will be
accounted for by the parents.
Objective 3: To identify parental alleles in a group of siblings derived from an inter-
heterotic cross.
o Hypothesis 1. Progeny lines have an overrepresentation of alleles from the parent with
the same behaviour in crosses (heterotic pattern) as the progeny.
o Hypothesis 2. There are large parental linkage blocks common to all siblings.
20
Objective 4: To identify QTLs for additive effects for grain yield using a mixed model
approach with phenotypic data from NCII.
o Hypothesis 1: A genomic relationship matrix, generated from GBS data, can be used to
describe the variance/covariance structure of the polygenic term in the model, and
substitute for the traditional pedigree-based matrix.
o Hypothesis 2: Based on the density x additive genetic variance present in Stiff Stalk,
determined from NCII ANOVA results, it is expected that QTLs will be detected in the
Stiff Stalk background.
o Hypothesis 3. Based on the absence of additive genetic variance in Iodent germplasm,
determined from NCII ANOVA results, it is expected that no QTLs will be detected in
the Iodent background.
o Hypothesis 4. QTLs for grain yield are numerous, with small effects.
21
CHAPTER 3: APPLICATIONS OF GENOTYPING-BY-SEQUENCING DATA IN
MAIZE BREEDING PROGRAMS
3.1 ABSTRACT
As genotyping-by-sequencing (GBS) data becomes more abundant, new applications of
this technology in breeding programs are possible. While GBS data is typically used for genomic
selection and QTL mapping, GBS information can also be used to establish relationships
between lines and examine inheritance of parental linkage blocks. This study presents methods to
utilize GBS data in this manner using publically available data to: (1) examine heterotic patterns
in the germplasm; (2) identify putative parents for lines with unknown parentage; and (3)
examine the genomic structure of lines derived from breeding crosses. Marker data from nearly
1,200 inbred lines, including public, ex-PVP, and University of Guelph germplasm was used to
generate a genetic similarity matrix. Using the genetic similarity matrix, a network diagram and
hierarchical cluster diagram were generated, which reflected the heterotic patterns of this
germplasm. Hierarchical clustering was used to identify putative parents of Guelph lines from a
publically available dataset, as demonstrated with examples of known family structures. In two
sets of siblings derived from breeding crosses, parental linkage blocks were determined and
visualized across the chromosomes. These methods allow breeders to maximize the use of their
marker data by analyzing the relationships between lines in their breeding program.
3.2 INTRODUCTION
Maize breeding is divided into two different activities: inbred line development and
hybrid commercialization (Duvick and Cassman 1999; Lee and Tollenaar 2007). Inbred line
development, particularly in the commercial sector, now routinely uses molecular marker data to
predict the genetic value of plants through what is called genomic selection (GS) (Meuwissen et
al. 2001; Bernardo and Yu 2007; Piepho 2009; Crossa et al. 2010). Recently, genotyping-by-
22
sequencing (GBS) methodologies have been developed to assess a large number of SNP markers
at low cost per sample in large genome species such as maize (Elshire et al. 2011). Currently,
GBS data for 17,289 maize lines, assessed at 955,690 markers, is publically available at
panzea.org. Many applications of GBS data have been described, including GS (Crossa et al.
2013; Zhang et al. 2015), QTL mapping (Romay et al. 2013; Li et al. 2015a; Li et al. 2015b;
Zhou et al. 2016), prediction of deleterious mutations (Mezmouk and Ross-Ibarra 2014), and
analysis of population structure (Romay et al. 2013; Wu et al. 2016). However, the potential of
this data for understanding the structure of the commercial maize germplasm pool, the
relationships between maize lines, and the inheritance of linkage blocks has not received much
attention.
Maize germplasm is complex, including varieties of popcorn, sweet corn, and field corn,
grown in diverse regions around the world. Over the last 100 years, field corn germplasm in the
U.S. has shifted from open pollinated cultivars to hybrids. In the early 1900s, it is estimated that
U.S. field corn consisted of 1000 open pollinated cultivars (Montgomery 1916). Genetic
improvement was achieved through recurrent selection of ears from female plants with desirable
traits. The concept of hybrid maize was then developed by Shull (1908, 1909) and East (1909),
with the concept of double-cross hybrids contributed by Jones (1918). Double-cross hybrids
began replacing open pollinated cultivars in the 1930s, and these were then replaced by single
cross hybrids, beginning in the 1960s (Crow 1998). The initial inbred lines were derived from
the self-pollination of virtually all of the open pollinated varieties (Troyer 1999). Private
companies began their breeding programs by utilizing the same public germplasm in the 1920s
and 1930s, and subsequently developing new inbred lines by self-pollinating superior,
commercial hybrids from their private breeding programs (Troyer 1999). Over time, germplasm
23
has been exchanged between breeding programs through the self-pollination of competitors’
commercial hybrids (Troyer and Mikel 2010). Since the 1970s, breeding has become
increasingly privatized and the role of public inbred lines in commercial hybrids has continually
declined (Darrah and Zuber 1986; Mikel and Dudley 2006). The practice of deriving inbred lines
from competitors’ commercial hybrids stopped with the patenting of hybrids. Maize hybrids in
the U.S. are currently produced from propriety inbred lines protected by the U.S. Plant Variety
Protection Act (PVP), which was passed in 1970 (Mikel and Dudley 2006; Nelson et al. 2008)
and through plant patents. Lines with expired PVP (ex-PVP) can then be used in public breeding
programs and for research purposes to gain insight about proprietary germplasm.
The breeding of field corn utilizes heterotic patterns, which function to make heterosis
predictable, when lines from different heterotic patterns are crossed. These heterotic patterns are
typically maintained by keeping the populations of inbred lines genetically distinct from each
other (Bernardo 2001). New inbred lines are typically generated within a heterotic group by
recycling elite inbred lines, which maintain genetic dissimilarity between the heterotic groups
but also limits diversity within the heterotic group (Mikel 2008). Since most elite inbred lines in
a heterotic group are related to a few founder individuals, heterotic patterns can be described
based on these key founder individuals (Darrah and Zuber 1986; Nelson et al. 2008). The U.S.
maize heterotic patterns fit into two overarching groups, reflecting the major division between
heterotic patterns: Iowa Stiff Stalk Synthetic (Stiff Stalk) and non-Stiff Stalk (Duvick 2005;
Mikel and Dudley 2006; Mikel 2008). However, U.S. germplasm has also been described as
three main groups: Stiff Stalk (with founder lines B73, B14, B37), non-Stiff Stalk/ Lancaster
(Mo17, Oh43), and Iodent (PH207) (Darrah and Zuber 1986; Mikel and Dudley 2006).
24
Molecular markers can be used to understand the structure of commercial germplasm
pools and the relationships between maize lines. A common method of population structure
analysis is hierarchical clustering, which has been performed using several types of markers,
beginning with RFLPs (Mumm et al. 1994), then using SSR (Barata and Carena 2006), and SNPs
(Nelson et al. 2011; Wu et al. 2016). Hierarchical clustering with SNP data has been used to
separate U.S. field germplasm from tropical, sweet corn and popcorn germplasm (Hansey et al.
2011). Clustering with SNP data also divides ex-PVP and public U.S. field corn germplasm into
two main subgroups, Stiff Stalk and non-Stiff Stalk lines (Nelson et al. 2008; Hansey et al.
2012). A dendrogram of 92 ex-PVP lines, created using 614 SNPs, resulted in six predominant
clusters, described by the key founder line in each cluster: B73, Mo17, PH207, A632, Oh43, and
B37 (Nelson et al. 2008).
While hierarchical clustering will always calculate clusters regardless of the strength of
the connection, there are other methods of analyzing population structure that reflect the degree
of similarity or dissimilarity between individuals. For example, principal component analysis
(PCA) uses a similarity matrix to identify principal components, to describe the variability of the
data. For 92 ex-PVP lines, the first two principal components from PCA created a tetrahedral
cloud with vertices of B73, Mo17 and PH207, reflecting the key founder lines in the germplasm
(Nelson et al. 2008). For 544 tropical lines, PCA indicated three clearly distinguished major
subgroups, which were consistent with the environmental adaptations of the germplasm and
CIMMYT breeding records (Wu et al. 2016). Another method, termed principal coordinate
analysis (PCOA), utilizes a dissimilarity matrix and reflects the distance between pairs of points.
The analysis of 2,815 maize inbred lines by PCOA, using GBS data, revealed groups of known
maize subpopulations, including tropical, sweet corn, popcorn, Stiff Stalk, and non-Stiff Stalk
25
lines (Romay et al. 2013). A new, recently described approach is the network diagram, which
uses a similarity matrix as input. An algorithm, performed by the software Gephi, disperses the
lines in 2-dimensional space based on the simultaneous attraction of similar points and the
repulsion of dissimilar points (Bastian et al. 2009). This method has been applied to 212 ex-PVP
lines, which generated three main clusters of germplasm, Stiff Stalk, non-Stiff Stalk, and Iodent,
which reflect the heterotic patterns of the analyzed germplasm (Romay et al. 2013).
The application of marker data to investigate the inheritance of linkage blocks in
breeding populations has not received much attention. High density genotypic markers are ideal
for identifying signatures of selection, which are genomic regions that have been subjected to
selection pressure and likely contain genes underlying biologically important traits (Cadzow et
al. 2014). When a genomic region in two individuals contains alleles that were are inherited from
a common ancestor, this shared genomic segment is said to be identical-by-descent (IBD). High
density SNP data has been used to identify IBD regions in small populations of descendants from
three founder lines in Chinese maize breeding: Dan340, Huangzao4 and Mo17 (Liu et al. 2015).
In each population, the lines were genotyped for 40k SNPs, and genomic regions common
among all descendants, that also matched the founder parent, were identified as IBD regions,
with over a thousand IBD regions detected in each population. This method has not yet been
applied to temperate maize populations.
This paper examines several uses of GBS data in maize breeding programs. Specifically,
the utility of hierarchical clustering and a network diagram for assigning lines to heterotic
patterns is examined. Then, a strategy for using hierarchical clustering with GBS data to
establish parentage between lines in the absence of well-defined pedigree information is
26
illustrated. And finally, this work demonstrates how GBS data can be used to examine the
genomic structure of progeny from various breeding crosses.
3.3 METHODS
3.3.1 Germplasm and marker data
The germplasm selected for sequencing included 126 inbred lines from the University of
Guelph maize breeding program, belonging primarily to the Stiff Stalk, Iodent and Lancaster
heterotic patterns, and 16 ex-PVP lines (Table 3.1). Leaf discs from several plants were bulked
and DNA was extracted (University of Wisconsin, Genomes to Field). GBS data was generated
for the 142 inbred lines by the Genomic Diversity Facility at Cornell University using the
method described by Elshire et al. (2011), hence referred to as the Guelph GBS data. The Guelph
GBS data was partially imputed and assessed at 955,690 SNPs. Publically available GBS data for
17,280 inbred lines was also obtained from panzea.org (file AllZeaGBSv2.7, posted December
18, 2013).
Table 3.1. Pedigrees and families of Guelph inbred lines and Ex-PVP lines used to generate GBS
data. Pedigree information from E. Lee (unpublished data) and U.S. plant variety protection
certificates.
Guelph lines Heterotic
Pattern
Family Pedigree
CG1
Funks G10
CG2
Funks G10
CG3
Funks G10
CG4
Funks G10
CG5
B73 Funks G10
CG6
Open-pollinated flint variety from the sub-alpine
area of Europe
CG7
from second cycle lines developed in England for
cold tolerance
CG8
Open-pollinated flint variety from the sub-alpine
area of Europe
CG9
from second cycle lines developed in England for
cold tolerance
27
Table 3.1. cont.
Guelph lines Heterotic
Pattern
Family Pedigree
CG10
from second cycle lines developed in England for
cold tolerance
CG13
Golden Glow
CG14
Wigor
CG15
Wigor
CG16 Misc Wf9/Mn13 (CG7 x CM37)
CG17 Misc Wf9/Mn13 (CG7 x CM37)
CG18 Misc Wf9/Mn13 (CG7 x CM37)
CG19 Misc Wf9/Mn13 (CG7 x CM37)
CG20 Misc Wf9/Mn13 (CG7 x CM37)
CG21 Misc Wf9/Mn13 CH593-9 BC2 NRP GL5
CG22 Misc Wf9/Mn13 CH593-9 BC2 NRP GL5
CG23
CH591-23 BC2 NRP A498
CG24 Misc Wf9/Mn13 (CG19 x CG17) x Pioneer3950
CG25
CH593-9 BC3 NRP A498
CG26
F2 x F7
CG27 Stiff Stalks B73 B73 BC1 NRP early single-cross
CG28
CG Synthetic A (S) C0
CG29
CG Synthetic B (S) C0
CG30
CG Synthetic B (S) C0
CG31
CG HOPE A
CG32
CG HOPE A
CG33 Misc Wf9/Mn13 (CG19 x CG17) x Pioneer3950
CG34
(CG19 x CG17) x Pioneer3950
CG37
Pioneer3803
CG38
Pioneer3803
CG40
Pioneer3902
CG41 Iodent
Pioneer3902
CG42 Stiff Stalks B73
CG44 Iodent 207 Pinoeer3790
CG50
France-Canada Gene Pool
CG52
Morden Stiff Stalk Gene Pool
CG53 Misc Wf9/Mn13 Stalk Quality III
CG54 Misc Wf9/Mn13 Stalk Quality VII (Stress pop.)
CG55 Iodent
Pioneer3881
CG56 Iodent
Pioneer3925
28
Table 3.1. cont.
Guelph lines Heterotic
Pattern
Family Pedigree
CG57 Stiff Stalks unrSS Pioneer3803
CG58 Iodent
Pioneer3906
CG60 Iodent 207 Pioneer3902
CG61 Iodent
Pioneer3902
CG62
Pioneer3902
CG63 Iodent
Pioneer3790
CG64 Iodent
Pioneer3790
CG65 Stiff Stalks unrSS Pioneer3803
CG66 Iodent
Pioneer3901
CG67 Iodent
Pioneer3901
CG68 Iodent
Pioneer3707
CG69 Misc Wf9/Mn13 Pioneer3929
CG70 Misc Wf9/Mn13 Pioneer3929
CG71 Misc Wf9/Mn13 Pioneer3929
CG73 Iodent
Stalk Quality VII (Stress pop.)
CG79 Stiff Stalks unrSS Pioneer3803
CG80
Pioneer3737
CG84 Misc Wf9/Mn13 Pioneer3929
CG85 Iodent
Pioneer3790
CG86 Iodent
Pioneer3790
CG88
CG Lancaster (RRS)
CG89 Misc Wf9/Mn13 S2 line from CG CBI x Pioneer3929
CG90
S2 line from CGSynA x Pioneer3902
CG91
S2 line from CG Hope IA x Pioneer3921
CG92
CG23 x Pioneer3921
CG93 Misc Wf9/Mn13 S5 line from Pioneer3969 x Pioneer3929
CG94 Misc Wf9/Mn13 S2 line from CG Lancaster (Bio) x Pioneer3921
CG95 Misc Wf9/Mn13 S2 line from FCGP x Pioneer3929
CG97 Iodent
S2 line from CG Wigor x Pioneer3921
CG98 Misc Wf9/Mn13 S2 line from SQII x Pioneer3929
CG99 Misc Wf9/Mn13 S2 line from CG SS (BIO) x Pioneer3921
CG100 Misc Wf9/Mn13 S2 line from SQI x Pioneer3929
CG101
S2 line from CG Lancaster (BIO) x Pioneer3921
CG102 Stiff Stalks B14 CG Stiff Stalk combined C2
CG104 Iodent 207 Pioneer3902
CG105 Iodent 207 Pioneer3902
29
Table 3.1. cont.
Guelph lines Heterotic
Pattern
Family Pedigree
CG106 Misc Wf9/Mn13 (CG33 x CG34) x (BSL (S4) C7 x CGSyn A-NL)
(S4)
CG108 Iodent 207 Pioneer3902
CG110 Lancaster unknown CCGP A C2S2 x NK2555
CG111 Lancaster unknown CCGP B C2S2 x NK2555
CG112
(CG SynA C7 x Pioneer 3475)
CG113
(CG SynA C7 x Pioneer 3475)
CG114
(CG102 x G-4193)
CG115
CG CBI C3 x Pioneer 3876
CG118 Stiff Stalks B14 (CG65 x CG33) x CG102
CG119 Stiff Stalks B14 (CG65 x CG33) x CG102
CG120 Stiff Stalks B14 (CG65 x CG33) x CG102
CG121 Stiff Stalks B14 CG60 x CG102 Intermated pop.
CG122 Iodent 207 CG60 x CG102 SSD pop.
CG123 Lancaster Oh43-Iodent CGR01 x CG110
CG124 Lancaster Oh43 (CG103/CO422 x B100/CG88) x LH38
CG125 Iodent 207 (CG108 x Carribbean Flint) x CG108
CG126 Lancaster unknown CG111 x (CG111 x Mexican Dent)
CG127 Iodent 207 (CG62 x Mexican Dent) x CG62
CGR01 Lancaster unknown (B93 x NK2555) x NK2555
CGR02
(CG44x(MF4xP1247)-3)-B-2-1-1-1
CGR03 Lancaster Oh43-Iodent (B100 x NK2555) x NK2555
CGX333 Stiff Stalks B73 SD79 x SD80, white
HiC1 Stiff Stalks B14 CH04030 x CG102-2(G)-1-1-1
HiC3 Stiff Stalks B14 UR13088 x CG102-6(G)-1-4-1
HiC4 Stiff Stalks B14 UR13088 x CG102-3(G)-1-3-1
HiC5 Stiff Stalks B14 UR13088 x CG102-3(G)-1-4-1
HiC6 Stiff Stalks B14 UR01089 x CG102-2(G)-1-1-1
HiC8 Lancaster unknown CH05015 x CG33-1(R)-1-3-1
HiC9 Lancaster unknown CH05015 x CG33-1(R)-1-4-1
HiC11 Iodent unknown AR13026 x CG60/CG62-4(G)-1-1-1
HiC12 Iodent unknown AR13026 x CG60/CG62-4(G)-1-3-1
HiC13 Iodent unknown AR13026 x CG60/CG62-1(G)-1-1-1
HiC14 Iodent unknown AR13026 x CG60/CG62-1(G)-1-3-1
HiC17 Iodent unknown AR13026 x CG60/CG62-1(R)-1-2-1
HiC21 Iodent unknown AR13026 x CG60/CG62-5(R)-1-1-1
30
Table 3.1. cont.
Guelph lines Heterotic
Pattern
Family Pedigree
HiC22 Iodent unknown AR13026 x CG60/CG62-5(R)-1-6-1
HiC23 Iodent unknown AR13026 x CG60/CG62-5(R)-1-11-1
HiC24 Lancaster unknown AR13026 x CG33-3(G)-1-3-1
HiC25 Lancaster unknown AR13026 x CG33-3(G)-1-5-1
HiC26 Lancaster unknown AR13026 x CG33-3(G)-1-6-1
HiC27 Stiff Stalks B14 AR13026 x CG102-18(R)-1-3-1
HiC28 Stiff Stalks B14 AR13026 x CG102-18(R)-1-4-1
HiC29 Stiff Stalks B14 AR13026 x CG102-18(R)-1-8-1
HiC30 Stiff Stalks B14 AR13026 x CG102-22(R)-1-6-1
HiC32 Stiff Stalks B14 AR13026 x CG102-24(R)-1-3-1
HiC33 Stiff Stalks B14 AR13026 x CG102-24(R)-1-5-1
Ex-PVP lines Heterotic
Pattern
Family Pedigree
(DK)2MCDB Lancaster unknown #1 2MA22 x 4780 composite
(DK)8M129 Lancaster BS11 78060A x 88144
(AS)5707 Lancaster C103
LH159 Lancaster BS11 Pioneer3160
LH210 Lancaster BS11 LH51 x BS11LH C3
LH216 Lancaster Mo17 (LH51 x LH123) x LH51
PHBW8 Stiff Stalks unrSS PHJ40 x PHW52
PHEG9 Stiff Stalks B84 PHG86 x PHW52
PHGG7 Lancaster Wf9 PHT64 x PHG49
PHHV4 Stiff Stalks B84 PHG69 x PHM44
PHK56 Lancaster Oh7-
Midland/Iodent
PHG47 x PHG35
PHKE6 Iodent KE6 PHG29 x PHG47
PHRE1 Stiff Stalks unrSS PHJ40 x PHR47
PHVJ4 Iodent VJ4 PHJ40 x 207
PHW53 Iodent 207 G50 x PHZ51
PHW80 Lancaster C103 PHK76 x PHN37
3.3.2 Network Diagram
A subset of the data consisting of 1,191 lines was used to generate the network diagram.
The selected lines included 110 Guelph inbred lines and 1,081 lines from the Panzea data set,
31
which were deposited by Romay et al. (2013). The 1,081 lines selected are U.S. ex-PVP and
public field corn from the Stiff Stalk, Iodent and non-Stiff Stalk heterotic patterns. Markers were
filtered using TASSEL (Bradbury et al. 2007), for minor allele frequency (maf) ≥ 0.05 and
maximum 90% missing data. The 353,994 SNPs passing filtering criteria were re-coded to
additive components, indicating the number of minor alleles at each locus (i.e. 0, 1, 2) using
PLINK v1.9 (Purcell et al. 2007). An identity-by-state (IBS) matrix was calculated using the
method of VanRaden (VanRaden 2008). The network diagram was created with the IBS matrix
using the network visualization platform Gephi v.0.8.2 (Bastian et al. 2009). The force-directed
layout Force Atlas was used. The connection between nodes was filtered based on IBS values,
and different levels of this filtering parameter, from IBS > 0.1 to IBS > 0.9, were assessed. The
value of the filtering parameter that generated the best separation of clusters was selected, which
was an IBS value > 0.3.
3.3.3 Hierarchical clustering
The Panzea public dataset was filtered for lines that had pedigree information, resulting
in 2,554 lines. Then, 142 Guelph lines were added to this dataset, for a total of 2,696 lines. This
dataset contains popcorn, sweet corn, tropical, U.S. public and ex-PVP germplasm, and lines
derived in Guelph. Markers with maf ≥ 0.05 and maximum 90% missing data were retained,
using VCFtools v0.1.12b, with 288,878 SNPs passing filtering (Danecek et al. 2011). SNPs were
re-coded to additive components, using PLINK (Purcell et al. 2007). An IBS matrix was
calculated using the VanRaden subroutine (VanRaden 2008). The IBS matrix was scaled and
centered in R 3.1.3 (R Core Team 2015). A distance matrix was computed in R using the
Euclidean distance. Hierarchical clustering was performed on the distance matrix using the
Ward’s minimum variance method and a dendrogram was created.
32
The clustering method to identify putative parents was identical to the clustering method
described above. Here, clustering was performed using one Guelph line at a time, or one set of
siblings at a time, with 2,570 other lines (2,554 lines from the Panzea dataset plus 16 ex-PVP
lines). For each Guelph line assessed, a new dataset was created, a new IBS matrix was
calculated and the clustering was performed. To assess the similarity between the line of interest
and the neighbours in the clustering output, the percentage of matching SNP alleles was
determined. To do this, a custom script, which ignored missing loci and calculated the number of
loci where the SNP alleles were identical between a line and the proposed relative was used. To
cluster a line with the second (or third) parent, SNP locations where the genotype of the line
matched the genotype of the proposed parent #1 were removed from the dataset. With this subset
of SNPs, the IBS matrix was recalculated and the clustering procedure repeated. To evaluate the
similarity between lines and putative parents identified through clustering, the percentage of
alleles unaccounted for by the parents was determined. Using 0 for major allele and 1 for minor
allele, the following scenarios were deemed as “unaccounted for”: (1) offspring was 0/0 and all
parents were 1/1; (2) offspring was 1/1 and all parents were 0/0; and (3) offspring was
heterozygous and all parents were 0/0 or all parents are 1/1. The percentage of alleles
unaccounted for was then calculated by dividing the number of unaccounted for loci by the
number of loci with non-missing data for all the lines. The percentage of unaccounted for alleles
in randomly generated parent and offspring combinations, from the dataset of 2,696 lines, was
determined by using a random number generator to select a line, parent #1 and parent #2 from
the dataset, then calculating the percentage of alleles unaccounted for between the chosen line
and parents, using the method described above. A histogram of the proportion of unaccounted for
alleles for 10,000 randomly generated pedigrees was generated in using a spreadsheet.
33
3.3.4 Identity by descent
There were two unique sets of sibling inbred lines. The first set included seven siblings
derived from a Stiff Stalk by Iodent two-way inter-heterotic breeding cross. All seven inbred
lines behaved as Iodent inbred lines when crossed to inbred lines of defined heterotic patterns.
The second set contained three siblings that behaved as Stiff Stalk inbred lines when crossed to
inbred lines of defined heterotic patterns. This set was derived from the following three-way
breeding cross: [CG102 x (CG33 x CG65)]. CG102 and CG65 belong to the Stiff Stalk heterotic
pattern, while CG33 belongs to the WF9/Mn13 family of the Lancaster heterotic pattern. For
each inbred, SNPs locations were filtered for no missing data between the line and the parents,
using VCFtools (Danecek et al. 2011). A custom script then identified SNPs where the offspring
was homozygous and the allele was only present in one of the parents. Ideograms of parental
SNP alleles were created with PhenoGram (Wolfe et al. 2013) using chromosome lengths and
centromere positions obtained from B73 RefGen_v2 from Maize GDB (Andorf et al. 2010). For
the three-way cross inbred lines, the number of SNP alleles inherited from each parent was tested
for deviation from the expected 2:1:1 ratio using a chi-square test.
Parental blocks in each individual were determined using a custom script that identified
regions where a minimum of two consecutive SNP alleles originated from the same parent. The
length of the block was determined by subtracting the position of the beginning SNP from the
position of the ending SNP in the block. The average number and size of the blocks was
generated using a spreadsheet. A histogram of the sizes of parental linkage blocks for each
inbred line was generated using a spreadsheet.
To determine IBD regions shared between siblings within a dataset, each dataset was first
filtered for no missing data in any of the lines or parents. A custom script was then used to
34
identify SNPs that were polymorphic for the parents and had identical alleles between all the
siblings. An ideogram of alleles common to all siblings in the dataset was generated using
PhenoGram (Wolfe et al. 2013). The shared SNPs between the seven siblings of the Stiff Stalk
by Iodent cross were visually inspected to identify regions of dense marker coverage, which
were then indicated on the ideogram.
3.4 RESULTS AND DISCUSSION
3.4.1. Classifying Lines into heterotic patterns using GBS data
The number of lines to that could be analyzed simultaneously with the network diagram
was limited due to the memory and processing power required to run the software. For this
network diagram experiment, the sample size was limited to approximately 1,200 lines. Because
this experiment aimed to generate nodes reflecting U.S. heterotic patterns, only U.S. field corn
inbred lines and Guelph inbred lines were selected. The selected set of inbred lines included 110
Guelph lines and 1,081 U.S. public and ex-PVP field corn lines from the Panzea data set, which
were deposited by Romay et al. (2013).
The network diagram generated from 1,191 lines resolved into three primary nodes which
correspond to the three main heterotic pools: Stiff Stalk, Lancaster/Non-Stiff Stalk and Iodent
(Figure 3.1). Within the Stiff Stalk and Lancaster main nodes there was further grouping of lines
into founder backgrounds. In the Lancaster node these groups corresponded to Oh43, Mo17 and
Wf9. In the Stiff Stalk node, while there was less definition between the groups, the groups
corresponded to B14, B73 and B37. Besides the three primary nodes, there were several other
minor groups of inbred lines. These smaller groups appear to be based on shared geographical
origin and did not fit into the three main nodes. The first is a grouping of Nebraska lines to the
left of the Iodent node (Figure 3.1). The second is a grouping of predominantly North Carolina
35
lines to the right of the Iodent cluster. There are also many small clusters between the Lancaster
and Stiff Stalk groups, which tend toward Lancaster. The population structure of these lines is
listed as “unclassified” in the file of ex-PVP lines deposited by Romay et al. (2013). The
coloured grouping to the right of the Iodent node is composed of Guelph high-carotenoid (HiC)
inbred lines (Figure 3.1). The HiC inbred lines were derived from breeding crosses involving
populations belonging to the Argentine Orange flints (Burt et al. 2011). This method was first
used on GBS data by Romay et al. (2013) on a set of 212 ex-PVP inbred lines which also
resulted in generating three main nodes corresponding to Stiff Stalk, non-Stiff Stalk and Iodent
heterotic patterns. And similar to Romay et al. (2013), lines with shared pedigrees show close
proximity in the diagram. For the Guelph inbred lines, the assignment to a primary node
corresponds to the heterotic patterns of the inbred lines.
Hierarchical clustering was performed using the full set of 2,696 inbred lines, including
all of the lines present in the network diagram, as well as popcorn, sweet corn, and tropical
germplasm. The dendrogram showed separation of tropical germplasm (Nigeria, Mexico,
Thailand and North Carolina) from the rest of the lines, and also showed separation of sweet
corn and popcorn from the U.S. field corn (Figure 3.2, Appendix Figure 1). The separation of
these groups in into distinct clusters is consistent with what was observed by Hansey et al. (2011,
2012). Within U.S. field corn germplasm, there were further divisions between Stiff Stalk
germplasm (including the founder lines B73, B37 and B14) and non-Stiff Stalk germplasm, but
no clear division of Lancaster and Iodent within the non-Stiff Stalk cluster. The lack of clear
distinction between non-Stiff Stalk subgroups was also observed by Hansey et al. (2011, 2012).
36
(A)
(B)
Figure 3.1. Network diagram for 1,081 public and ex-PVP lines and 110 Guelph lines generated
using IBS values greater than 0.3 (A). Points sharing the same colour (red, green, pink or dark
blue) indicate Guelph inbred lines derived from the same pedigree. Other Guelph lines are
coloured light blue. Founder inbred lines of heterotic patterns are labelled. The germplasm
clusters into three main heterotic pools: Stiff Stalk (B73, B37, B14), Iodent (PH207) and
Lancaster (Mo17, Oh43, Wf9). Network diagram, with labels, zoomed in to the region
containing B73 (circled) showing the level of detail generated in the diagram (B).
B73
B14
B37
PH207
Wf9
Oh43 Mo17
37
Figure 3.2. Visual representation of the main division of branches in the dendrogram. The
branches containing tropical, popcorn, and sweet corn are labelled. The remaining branches of
the dendrogram contain U.S. field corn and Guelph lines, with the locations of key founder
individuals indicated.
Hierarchical clustering using Ward’s method was largely influenced by family structures
in the dataset, which complicates the interpretation of results and reduces the usefulness of this
approach. Using known family structures in the dataset, it was observed that sibling groups
derived from inter-heterotic crosses had an effect on the clustering of the parent lines. A group of
Guelph siblings were derived from PHJ40 x PHR25, an inter-heterotic cross of Stiff Stalk x
Iodent. When the dataset did not include any Guelph lines, the branches containing PHJ40 and
PHR25 diverge at the third split in the dendrogram and do not cluster close together, consistent
with their belonging to different heterotic groups (Fig. 3.2A, B). When all the Guelph lines were
included in the dataset, PHJ40 and PHR25 then cluster together, with their Guelph progeny (Fig.
3.2C). This example demonstrates the effect of shared offspring from an inter-heterotic cross on
the clustering output. The influence of a large numbers of progeny from an inter-heterotic cross
was also observed in the network diagram. The Stiff Stalk parent, PHJ40, was brought to the
periphery of the Iodent node, rather than being in the Stiff Stalk node, when progeny was
included in the analysis. This phenomenon should be considered when selecting a dataset for the
analysis of heterotic patterns.
38
(A) (B)
(C)
Figure 3.3. Subset of the dendrogram showing closest neighbours of PHJ40 (A) and PHR25 (B),
located on branches that diverge in the third split of the dendrogram, when University of Guelph
lines are not included in the dataset. When progeny from PHJ40 x PHR25 are included in the
dataset, these lines cluster together, with their progeny (C).
The network diagram was more effective at assigning lines to heterotic patterns than
hierarchical clustering with Ward’s method. While both methods separated Stiff Stalk from non-
Stiff Stalk lines, only the network diagram generated distinct Lancaster and Iodent nodes within
the non-Stiff Stalk group. Another advantage of the network diagram was the stability of the
main nodes that were generated, corresponding to the Stiff Stalk, Iodent and Lancaster heterotic
groups. In contrast, multiple creations of the network diagram revealed that germplasm with
weak connections to the main nodes had low stability, and were placed at arbitrary locations in
the diagram.
The merit of the network diagram is that it allows for multiple connections between
inbred lines, which gives a spatial sense of the strength of the connections between inbred lines.
39
With hierarchical clustering, the strength of the connection between inbred lines is not apparent
because every line is placed in the dendrogram, regardless of the extent of genetic similarity. In
the network diagram, lines with weak connections to any of the established groups can be easily
identified, as they are placed between the main nodes that form, or are placed in the periphery of
the diagram. The lines that fall between the main nodes are likely generated from a mix of
heterotic patterns and thus do not share strong similarity with either heterotic pattern. Within the
main nodes, the network diagram allows for the generation of sub-nodes, where the lines in the
center of the node have the strongest connection to other members of that node.
The previously described network diagram by Romay et al. (2013) has nicely defined
nodes with few points falling outside of the main nodes. In this study, the larger sample size of
nearly 1,200 lines, compared to 212 lines used by Romay et al. (2013), likely accounts for the
many small nodes falling outside of the main heterotic groups. The formation of clean nodes is
dependent on the germplasm used and the selection of lines with strong connections to other
inbred lines in the dataset. This research found that the network diagram is more informative
than to hierarchical clustering with Ward’s method for identifying heterotic patterns in US
maize.
3.4.2. Identification of putative parents of lines of unknown parentage using GBS data
One to the weaknesses of the hierarchical clustering approach however is also an attribute
that can be exploited to discover parentage of lines. Putative parents were identified using
hierarchical clustering of the dataset and identifying lines that clustered near the line of interest.
To test the usefulness of this clustering approach, the clustering output for a Guelph line with a
known pedigree, CG118, was investigated. The pedigree for CG118 is: CG102 x (CG33 x
40
CG65). The initial clustering of CG118 placed it with CG120 (sibling), CG119 (sibling), CG102
(parent) and CG114 (half-sibling of CG118). After removing SNPs that were shared between
CG118 and CG102, CG118 clustered with CG33 (second parent), CG34 (sibling of CG33),
CG106 (descendant of CG33) and CG24 (sibling of CG33). Removing SNPs that were shared
between CG118 and CG33, resulted in CG118 clustering with CG65 (third parent) and CG57
(sibling of CG65). This example demonstrates that if the parents are in the dataset, the line will
cluster with its parents as well as close relatives, and that removing SNPs matching the first
parent will lead to clustering with the second (and third) parent.
Using the above example, what is the best strategy for identifying the parental line from a
group of lines within a cluster? The hypothesis that the progeny will share the greatest
percentage of SNP alleles with the parent, over other relatives, was tested. The percentage of
SNP alleles shared between a line with known pedigree information, CG118, and clustering
neighbours was determined. For the initial clustering, the percentage matching between CG118
and neighbours was: 73.16% with CG102 a parent, 67.18% with CG114 a half-sibling, 67.05%
with CG120 a sibling, and 57.35% with CG119 another sibling. Removing the CG102 alleles
and regenerating the second clustering, CG118 shared the greatest percentage of SNPs with
CG33 a parent, compared to CG34, CG106 and CG24. The lines CG119 and CG120, also
derived from CG102 x (CG33 x CG65), were then assessed for percentage matching with these
same neighbours and both had the greatest percentage matching with CG102. For the second
clustering, CG120 shared the greatest percentage of SNP alleles with CG33, and CG119 shared
the most SNP alleles with CG106, a descendant of CG33. From this example, using known
family structures, it is clear that the progeny does not always share the greatest percentage of
alleles with the parent, over another relative. Assessing the percentage of shared alleles alone is
41
insufficient to distinguish parents from close relatives. In general, an inbred shares ~ 50% of
alleles with a parent, but also shares ~50% of alleles with full siblings. In addition, the genetic
relationships between maize lines are complicated because there is extensive inbreeding and
selection, which generates IBS values between relatives that will differ from what would be
found in livestock pedigrees, for example. For datasets with complex pedigree structures and
many relatives present, such as the example described, clustering appears to group the line with
its closest relatives but pedigree information is needed to sort out the nature of the relationships
between the lines.
This approach of using GBS data to identify putative parents via hierarchical clustering,
removing alleles shared with the parent and rerunning the analysis is subject to several sources of
error that could lead to alleles unaccounted for by the parents. The hypothesis that 100% of the
alleles in an inbred will be accounted for by the correct parents was tested by using 30 inbred
lines with known pedigrees (Table 3.2). The percentage of SNP alleles not accounted for by the
proposed parents ranged from 0.20% to 6.16%. These percentages may be higher than the
expected 0% due to the error rate of SNP calling in GBS data, which is reported at an average of
0.18% (Romay et al. 2013). Another possibility is that the seed source used for the GBS
sequencing of the publically available data may be different than the seed source used in the
Guelph breeding program and there may be small genetic differences between these lines.
Pedigree records may be incorrect, or finally mutation could generate novel SNPs. While the
percentage of alleles not accounted for by the correct parents can be as high as 6%, a low
percentage can also be obtained from an incorrect pedigree. For example, the correct linage of
CG118, which is CG102, CG33 and CG65, has a percentage of unaccounted for alleles of 0.30%.
Substituting CG33 for a full sibling, CG34, gives a percentage unaccounted for alleles of 0.79%.
42
This method is unable to confirm a correct pedigree, but can identify putative pedigrees that are a
poor fit, with a high percentage of alleles unaccounted for by the parents. To give context for the
genetic similarity of lines and putative parents, the genetic similarity of 10,000 randomly
selected parents from the dataset was assessed. The proportion of alleles unaccounted for
between randomly selected lines and parents ranged from 0% to 37.50%, with a median of
14.89% (Figure 3.3). From this sampling of 10,000 randomly generated pedigrees, the 99th
percentile point was 6.41% of alleles unaccounted for by the parents, so obtaining a percentage
of 6.41% or less is unlikely to occur from a randomly generated pedigree.
43
Table 3.2. The percentage of SNPs unaccounted for by either parent for inbred lines of known
parentage.
Inbred Parent 1 Parent 2 Parent 3 Total
SNPs
Number of alleles
unaccounted for
Percent
CG108 PHJ40 PHR25
118,485 234 0.20%
CG80 PHG29 PHG47
227,790 531 0.23%
PHKE6 PHG29 PHG47
138,473 345 0.25%
PHP85 PHK29 PHW52
247,111 638 0.26%
PHEG9 PHG86 PHW52
127,410 358 0.28%
CG118 CG102 CG33 CG65 173,142 512 0.30%
PHPR5 PHK76 PHW52
226,937 754 0.33%
CG119 CG102 CG33 CG65 116,816 393 0.34%
(NK)S8326 W117 Mo17 Mo17 221,812 858 0.39%
PHJ89 PHG47 PHT77
226,836 983 0.43%
PHW86 PHG71 PHG72
236,194 1,024 0.43%
(NK)W8555 B73 B84
247,588 1,252 0.51%
PHK56 PHG47 PHG35
90,879 463 0.51%
CG120 CG102 CG33 CG65 171,999 914 0.53%
LH74 A632 B73
251,461 1,340 0.53%
PHW80 PHK76 PHN37
133,451 743 0.56%
PHBA6 PHG47 PHZ51
226,971 1,350 0.59%
(NK)W8304 B14 B73 B73 256,948 1,554 0.60%
CG123 CGR01 CG110
71,101 460 0.65%
(NK)778 W117 B37
218,080 1,418 0.65%
(NK)807 W117 B37
215,138 1,443 0.67%
PHRE1 PHJ40 PHR47
140,547 1,081 0.77%
LH216 LH51 LH123 LH51 114,893 1,020 0.89%
CG122 CG60 CG102
75,149 890 1.18%
CG26 F2 F7
169,699 2,206 1.30%
LH214 LH123 LH51
236,921 5,066 2.14%
LH213 LH123 LH51
235,589 5,418 2.30%
CG121 CG60 CG102
77,335 2,122 2.74%
PHJ90 PHG50 PHK42
199,376 9,163 4.60%
PHW53 PHG50 PHZ51
142,058 8,747 6.16%
44
Figure 3.4. Histogram of the proportion of SNP alleles unaccounted for by either parent when
inbred lines and parents are randomly sampled from the dataset (n=10,000).
After using examples of lines with known pedigrees, this clustering method was applied to
lines from the Guelph breeding program which were created by self-pollinating commercial
single-cross hybrids. This was based on the assumption that one or both of the parents of the
single-cross hybrids were present in the dataset of ex-PVP inbred lines. For groups of full-
siblings in the dataset, the group of siblings was clustered with the public and ex-PVP lines, and
then each inbred was clustered individually with the public and ex-PVP lines. Comparing the
clustering results of groups of siblings, there were several types of outcomes observed, with the
percentage of alleles unaccounted for being used to assess the genetic closeness of the proposed
pedigree.
The first outcome was that each sibling clustered with the same ex-PVP inbred line in the
first and second clustering. For example, a group of five siblings first clustered with PHJ40, then
PHG72 in the second clustering (Figure 3.4). The percentage of alleles unaccounted for by either
45
of these proposed parents ranged from 0.54% to 0.76% for the five siblings, making it highly
probable that PHJ40 and PHG72 were the correct parents for this group of offspring.
(A) (B)
Figure 3.5. Clustering of five siblings places all five lines with PHJ40 as the putative parent (A).
Removing SNPs with identical alleles between the lines and PHJ40 places each line with
PHG72, showing CG64 as an example (B).
The second type of outcome was that some siblings in the group clustered with the same
proposed parents, but one sibling clustered with different lines. In a group of five siblings, four
siblings each clustered with LH85, then PHJ40 (Figure 3.5). The percentage of alleles
unaccounted for by either of the proposed parents ranged from 3.00% to 3.96% for these four
lines. The fifth sibling, CG42, did not cluster with the rest of the siblings (Figure 3.6), suggesting
that perhaps the pedigree records of CG42 are incorrect and it is not a sibling. In another
example, seven of the eight siblings clustered with PHJ40, then PHR25, with the percentage of
unaccounted for alleles ranging from 0.43% to 0.64%. The eighth sibling, CG60, had a much
higher percent of unaccounted for alleles, 6.34% for PHJ40 x PHR25. Initial clustering placed
CG60 with PHJ40 but the second clustering did not place CG60 with any ex-PVP lines. In the
case of CG60, it is most likely a progeny from PHJ40 x PHR25, as previous SSR marker work
indicated a high degree of IBD with the other siblings (Lee et al. 2006, 2007).
46
(A) (B)
Figure 3.6. Clustering of five siblings places four of the siblings together with LH85 as the
putative parent (A). For these four siblings, removing SNPs with identical alleles to LH85 places
each line with PHJ40, showing CG70 as an example (B). The fifth sibling, CG42, clusters
elsewhere (Figure 3.6).
(A) (B) (C)
Figure 3.7. Initial clustering places CG42 with Tx714 and N201 (A). Removing SNPs with
identical alleles to N201 (B) or Tx714 (C) places CG42 with NC278 as the second putative
parent.
Another outcome is that lines cluster with more than one close relative. Initial clustering
for CG42 placed it with N201 and Tx714 (Figure 3.6). Tx714 and N201 are genetically similar
and cluster together when CG42 is not in the dataset. Selecting either line as the first parent
results in clustering with NC278 as the second parent. The percentage of alleles unaccounted for
by proposed parents is 0.69% for Tx714 x NC278, and 0.76% for N201 x NC278. It appears that
both proposed pedigrees are a good fit, and the correct lineage could not be determined without
pedigree records.
The final outcome is that there was no consensus within a group of siblings as to which
inbred lines were the putative parents. This was observed for a group of five siblings (CG37,
CG38, CG57, CG65, and CG79) when lines were clustered individually (Figure 3.7). Two of the
siblings clustered with SD102 with differing surrounding ex-PVP lines. The other three siblings
47
had identical clustering results, all clustering with another group of ex-PVP lines. It is speculated
that the actual parents of these lines are perhaps not in the dataset. It may be useful to cluster
lines with a different set of germplasm in hopes of obtaining consistent clustering with one ex-
PVP line for all five siblings. While GBS data can be used to verify pedigree records and, in
some cases, identify putative parents of lines derived from unknown inbred lines, the major
limitation with this technique is that parents cannot be discerned from close relatives. However,
knowing the exact parentage may not be necessary, as identifying germplasm that is similar to
the line of interest is still beneficial to a plant breeder.
(A) (B)
(C) (D) (E)
Figure 3.8. Clustering of five siblings did not reveal a clear putative parent. When each sibling
was clustered individually, two of the siblings, CG57 (A) and CG65 (B), clustered with SD102,
with differing surrounding ex-PVP lines. The other three siblings, CG37 (C), CG38 (D) and
CG79 (E), clustered with the same set of ex-PVP lines, which differs from the clustering in (A)
and (B).
48
3.4.3. Determining genome contribution of the parents from a breeding cross using GBS
data
The proportion of alleles derived from each parent was determined for the two sets of
Guelph siblings: set #1 from a Stiff Stalk x Iodent breeding cross and set #2 from a three-way
cross involving two Stiff Stalk parents and one Lancaster parent. The seven siblings from set #1
were previously classified as Iodent inbred lines, therefore an overrepresentation of Iodent alleles
was expected. The three siblings from set #2 were classified as Stiff Stalk inbred lines, and the
expectation for this group was an over representation of Stiff Stalk alleles. These two
expectations were not consistently met. For example, the proportion of SNP alleles inherited
from the Iodent parent in set #1 ranged from 38.76% to 68.87% (Figure 3.8). For each inbred
line in the three-way cross dataset, the number of SNP alleles derived from each parent was
significantly different than expectations (2:1:1) (p < 0.001 for all tests of line vs. expectation),
although all lines share a majority of alleles with CG102, as expected (Figures 3.9 and 3.10).
These three siblings behave as Stiff Stalks, so a minimization of alleles from the Lancaster
parent, CG33, was expected. Instead, two of the inbreds had greater than the expected 25%
contribution of CG33. Surprisingly, these results suggest that in an inter-heterotic cross,
minimizing alleles from one heterotic pattern is not required for the line to behave as the other
heterotic pattern, based on the markers used in this study.
49
CG62 (68.9%) CG108 (54.7%) CG104 (53.9%)
CG40 (51.9%) CG41 (50.0%) CG105 (47.3%)
CG61 (38.8%)
Figure 3.9. Parental linkage block distribution across the seven inbreds derived from the two-
way breeding cross, PHJ40 (Stiff Stalk, blue) and PHR25 (Iodent, red). Of the SNP alleles that
could be assigned to a parent, the percentage of SNP alleles inherited from the Iodent parent is
shown.
50
CG118 CG119 CG120
Figure 3.10. Parental linkage block distribution across the three inbreds derived from the three-
way breeding cross. Linkage blocks in blue represent were inherited from CG102, red from
CG33, and green from CG65.
Figure 3.11. The expected and observed percentage of parental contribution in the 3-way cross
derived inbred lines based on SNP alleles that could be assigned to a parent. Parental lines:
CG102 (blue), CG65 (green) and CG33 (red). The number of markers used for each line in this
analysis are as follows: 26,467 for CG118, 16,987 for CG119, and 26,668 for CG120.
44%
39%
17%
CG118
63%
31%
6%
CG119
49%
22%
29%
CG120
50%
25%
25%
Expectation
CG102
CG33
CG65
51
While all the siblings in set #1 were derived from the same commercial hybrid, the inbred
CG108 was derived though modified full-sib mating, in contrast to the other siblings, which were
derived through the traditional self-pollination approach (Lee and Kannenberg 2004). The loss of
heterozygosity, in theory, is more gradual in the full-sib mating method compared to the rapid
inbreeding of the self-pollination method. Recombination events are only detectable in the inbred
lines if the recombination occurred in a region with heterozygosity. Because heterozygosity
persists longer in the full-sib mating method, it was expected the CG108 would have more
detectable recombination events, resulting in a larger number of small parental linkage blocks
than the other siblings. In contrast to this expectation, the number of linkage blocks in CG108
was the lowest among all the siblings (Table 3.3), and CG108 had proportionally fewer small
linkage blocks than the other lines (Figure 3.11). The linkage blocks in CG108 were fewer and
larger than the blocks of the other siblings, suggesting that CG108 had much lower rates of
recombination in heterozygous regions than the other siblings.
Table 3.3. Number and size of parental linkage blocks identified in 10 University of Guelph
inbred lines derived from breeding crosses.
Size of parental linkage blocks
Inbred Number of
SNPs
Length (base pairs) Number of
blocks
Set #1 max average
max average
CG40 2,625 70.88
122,320,764 1,960,463 1,004
CG41 2,041 49.03
98,933,285 1,351,890 1,432
CG61 5,395 54.52
190,407,262 1,528,351 1,287
CG62 1,417 71.36
86,974,542 2,049,483 951
CG104 2,012 51.04
88,093,024 1,397,499 1,383
CG105 3,036 37.35
138,508,477 1,022,729 1,869
CG108 1,472 191.42
121,714,668 10,133,250 196
Set #2
CG118 1,658 61.82
135,930,065 4,246,284 423
CG119 787 94.61
135,425,492 10,140,401 178
CG120 1,316 51.62
127,833,832 3,707,412 510
52
Figure 3.12. Histogram of sizes of parental linkage blocks, with the frequency of blocks in each
bin expressed as a percentage of the total number of linkage blocks for that inbred line.
Another potential application of GBS data is to identify parental genome segments that
are common across a group of selected progeny. In the Stiff Stalk x Iodent dataset, only 377
SNPs were polymorphic in the parents and shared between all seven siblings (Figure 3.12A).
From visual inspection of the ideogram, five parental genome segments were identified with
dense marker coverage. These regions contained 19 to 140 SNPs and are likely more meaningful,
in terms of selection during breeding, than regions marked by a single SNP (Table 3.4).
Unexpectedly, there were no large linkage blocks from the Iodent parent that were common to all
siblings. This suggests that lines derived from a two-way inter-heterotic cross do not require
large common linkage blocks from the Iodent parent for the inbred line to behave as an Iodent
heterotic pattern. In the three-way cross dataset, 4,289 SNPs were polymorphic for the parents
and shared between the three siblings (Figure 3.12B). The largest regions common across the
siblings were a region of CG102 on most of chromosome 10 and a region of CG33 on
53
chromosome 1. These shared regions may be due to random chance, rather than selection
pressure, but it is proposed that shared regions between siblings contain genes underlying
behavior as a Stiff Stalk and for desirable agronomic traits such as earliness. These examples
demonstrate that GBS data has applications for determining parental contribution to offspring,
parental genome segments, and shared regions between siblings in a maize breeding program.
(A) (B)
Figure 3.13. Ideograms of SNP alleles shared by all seven siblings from the Stiff Stalk x Iodent
cross (A) and all three siblings in the 3-way cross (B). Regions in blue represent PHJ40 and red
represent PHR25 (A); regions in blue represent CG102, red CG33, and green CG65 (B). Regions
of dense SNP markers are indicated by green boxes (A).
Table 3.4. IBD regions between the seven siblings of the Stiff Stalk x Iodent cross. These region
of dense marker coverage were selected based on visual inspection of the ideogram (Figure
3.11).
Chromosome Start
position
End
position
Length
(bp)
Number of
SNPs
Parent
1 280345879 295437051 15,091,172 43 PHJ40
2 232879401 236413006 3,533,605 41 PHR25
4 159297901 167074846 7,776,945 140 PHR25
9 4290930 8037689 3,746,759 19 PHJ40
9 23540249 26831698 3,291,449 28 PHR25
54
3.5 CONCLUSIONS
This paper explores alternative applications of GBS data to maize inbred lines. Lines can
be assigned to heterotic patterns more effectively using a network diagram approach than the
traditional hierarchical clustering. Hierarchical clustering is effective for identifying putative
parents for lines with unknown parentage or to confirm pedigree records. The percentage of SNP
alleles unaccounted for by either putative parent can be used to gauge whether the proposed
pedigree is a good fit. Finally, GBS data can be used to determine the extent of parental
contribution as well as identify parental genome segments in progeny derived from breeding
crosses.
55
CHAPTER 4: IN SILICO MAPPING OF MAIZE GRAIN YIELD QTLS USING
STRUCTURED CROSSES
4.1 ABSTRACT
In silico mapping integrates QTL mapping with plant breeding by detecting QTLs using
existing phenotypic data from breeding program trials. North Carolina Design II (NCII) is a
mating design commonly used in maize breeding to assess general and specific combining
ability. The use of in silico mapping in maize was explored using yield data from an NCII
breeding scheme with Stiff Stalk and Iodent commercial caliber material. The 110 hybrids were
evaluated for grain yield at three plant densities in three locations over three years. A mixed
linear model was used to test GBS marker alleles for associations with additive effects for grain
yield in each heterotic pattern. The matrix to describe the relationships between lines was
derived using two different methods: pedigree records and genotyping-by-sequencing data. The
methods produced very similar results, suggesting that marker data can substitute for pedigree
records in generating the relationship matrix. This research identified 123 significant SNP
associations for additive effects for grain yield in Stiff Stalk inbreds located on five
chromosomes. Six of the 12 bins containing QTLs have been reported to contain a grain yield
QTLs by previous studies. The SNPs together explain approximately 9.38% of the phenotypic
variance. No significant SNP associations were found for Iodent inbreds, demonstrating the
uniqueness of QTLs to specific heterotic backgrounds. QTLs detected from this approach can be
used for marker assisted selection or genome-wide selection in the breeding program. This paper
demonstrates an in silico mapping mixed model approach to integrate QTL mapping with NCII
using GBS data.
56
4.2 INTRODUCTION
Yield improvement efforts now incorporate genotypic data as an essential component of
plant breeding programs (Bernardo and Yu 2007). With the decreasing costs of marker data,
genotypic data now plays a large role in breeding through marker-based selection and breeding
value prediction using genome-wide selection. Identification of genomic regions influencing
traits of interest utilizes either a linkage mapping approach or an association mapping approach.
Traditional QTL mapping for grain yield has been performed using linkage mapping, which
utilizes a bi-parental cross to generate a mapping population. The recombination events that
occur from the bi-parental cross and subsequent inbreeding of the F2 lines creates linkage
disequilibrium between the markers and QTLs in this mapping population, facilitating QTL
detection. Previous studies have identified maize grain yield QTLs using linkage mapping with
populations composed of F2 derived lines (Austin and Lee 1996; Ribaut et al. 1997; Malosetti et
al. 2008) and populations derived from crossing the F2:F3 lines, RILs or double haploid lines
with a tester (Melchinger et al. 1998; Ho et al. 2002; Boer et al. 2007).
An alternate approach to linkage mapping is association mapping, also referred to as
linkage disequilibrium mapping or genome wide association study (GWAS). This approach
utilizes historic linkage disequilibrium, arising from historical recombination events, in a
population of related individuals (Zu et al. 2008). This linkage disequilibrium is present in the
entire population analyzed, rather than only being present in an experimentally generated
population. The use of existing phenotypic and genetic data to detect QTLs through
computational methods is referred to as in silico mapping (Grupe et al. 2001).
In silico association mapping can be performed using elite germplasm, which has a
number of advantages over traditional linkage mapping with a bi-parental cross: (1) As the set of
57
materials that the QTLs are detected in is elite germplasm, the information can be used directly
for marker assisted selection in a breeding program (Parisseaux and Bernardo 2004; Crepieux et
al. 2005); (2) Phenotypes can be generated with environmental replicates, reducing
environmental effects (Zhang et al. 2005); (3) Using existing phenotypic data has reduced cost
compared to generating and assessing phenotypes for a large mapping population (Parisseaux
and Bernardo 2004); and (4) A dataset of elite germplasm has greater potential for QTL
discovery because the parents of the bi-parental cross are likely to be monomorphic for some
markers and will have limited allelic diversity compared to the population as a whole (Parisseaux
and Bernardo 2004; van Eeuwijk et al. 2010; Bink et al. 2012).
This in silico mapping approach can either be applied to an association mapping panel or
to hybrid data from a maize breeding program. An association mapping panel is a collection of
elite inbred maize lines that are genotyped and phenotyped to use in GWAS. An association
mapping panel of 2,279 maize inbred lines from the USDA germplasm collection was used in
conjunction with GBS data for GWAS of growing degree days to 50% silking (Romay et al.
2013). Phenotypic data from breeding programs can also be used as an association mapping
panel. Zhang et al. (2005) mapped QTL for growing degree day heat units to pollen shedding
using 189 microsatellite markers and phenotypic data for 282 maize inbred lines belonging to
Pioneer Hi-bred International (Johnston, IA).
An alternate application of in silico mapping uses phenotypic data for maize hybrids,
generated in a breeding program, and genotypic data for parental inbred lines. This approach
detects QTLs in the parental inbred lines, with QTLs detected in each heterotic pattern
separately. This approach, as compared to an association mapping panel, has several advantages,
including: (1) The use of hybrid data can capture the heterotic phenotype of traits such as grain
58
yield and plant height, which cannot be achieved using phenotypic data of parental lines; (2) It
allows for detection of QTLs unique to a heterotic pattern, as well as cross-validation of QTLs
between heterotic patterns (Parisseaux and Bernardo 2004); and (3) Hybrid trials are conducted
in environments to which they are adapted, ensuring that detected QTLs are relevant to
environments used for crop production.
In silico association mapping using maize hybrid data from breeding programs has been
previously described. Parisseaux and Bernardo (2004) detected QTLs for grain moisture, plant
height and smut resistance using 96 SSR markers and 22,774 hybrids from the Limagrain
genetics program (France). The hybrids were generated from 1,266 inbred lines using nine
combinations of the nine heterotic groups. van Eeuwijk et al. (2010) detected QTLs for ear
height, plant height and yield using 769 SNPs and hybrid phenotypic data from Pioneer Hi-bred
International. The germplasm included 1,700 hybrids generated by crossing lines from two
heterotic groups. Both of these studies generated hybrids by crossing individuals from different
heterotic pools in various combinations, rather than systematically generating hybrids. The use of
a systematic mating design for in silico QTL discovery has not yet been explored.
North Carolina Design II (NCII) (Comstock and Robinson 1952) is a commonly used
mating design that is ideal for the integration of QTL mapping with existing maize hybrid data.
This mating design can be used to assess general and specific combining ability, and identify
superior inbreds and parental combinations. The NCII is a systematic mating design originally
developed for livestock, but it has been routinely used in maize as it nicely accommodates
heterotic patterns. In the NCII design, all female lines, belonging to one heterotic pattern, are
crossed to all male lines, belonging to a second heterotic pattern. The NCII structure partitions
the genotypic variance into effects due to the female, effects due to the male, and effects due to
59
the interaction of the male and female (Hallauer et al. 2010a). The genetic effects due to males
and females are equivalent to additive genetic effects, while the genetic effects due to the
interaction are equivalent to non-additive genetic effects (Rojas and Sprague 1952). This is an
efficient method to assess whether there is sufficient additive genetic effects in each heterotic
group to warrant a QTL mapping experiment with the data. The model described in this paper
aims to detect additive allele effects, and not dominance allele effects, because additive allele
effects generate predictable phenotypes, making them utilizable in a breeding program with
marker assisted selection.
This paper examines the potential of utilizing a structured mating design and GBS data
for in silico mapping. Inbred lines representing two heterotic patterns and possessing some
common ancestry were used to examine the potential of in silico QTL mapping. This study
utilized existing hybrid data for grain yield, at different plant densities, from an NCII consisting
of elite short season Stiff Stalk and Iodent inbred lines. While previous studies in silico mapping
studies have used pedigree information to establish the additive genetic relationships between
lines, this paper demonstrates the use of GBS data to estimate genomic relationships.
4.3 METHODS
4.3.1 Germplasm and Field Trials
Using a North Carolina Design II (Comstock and Robinson 1952) 110 hybrids were
created by crossing 11 Stiff Stalk inbred lines as females to 10 Iodent inbred lines (Table 4.1).
The inbred lines were comprised of ex-PVP inbred lines and inbred lines developed at the
University of Guelph (see Table 4.1). The 110 hybrids were grown at three locations (Alma,
Elora and Waterloo, ON) for three years (2009-2011) at three plant population densities (37,000,
74,000, and 148,000 plants ha-1) using a split-plot design with density as the main plot and
60
hybrids as sub-plots. Trials consisted of two replications per location and experimental units
were 2-row plots with 5.79 m rows, 0.76 m between rows, and 0.91 m between ranges. Data was
recorded for machine-harvestable grain yield, grain moisture, and bulk density (test weight). See
Holtrop (2016) for more details on the yield trials.
Pedigree records for the Stiff Stalk and Iodent heterotic groups were used to construct a
pedigree-based relationship matrix for each group. Additive relationship coefficients were
calculated from the pedigree records using the tabular method (Emik and Terrill 1949;
Henderson 1976). The diagonal elements were set to the default of 1, assuming parents of the
inbred line are unrelated.
Table 4.1. Pedigree of Stiff Stalk and Iodent inbred lines used in the NCII. Pedigree information
from E.A. Lee (Unpublished data) and United States plant variety protection certificates.
Inbred Female Heterotic Group Pedigree
CG37 Stiff Stalk Pioneer 3803
CG38 Stiff Stalk Pioneer 3803
CG57 Stiff Stalk Pioneer 3803
CG65 Stiff Stalk Pioneer 3803
CG79 Stiff Stalk Pioneer 3803
CG102 Stiff Stalk CG Stiff Stalk Combined C3
CG118 Stiff Stalk (CG65/CG33)/CG102
CG119 Stiff Stalk (CG65/CG33)/CG102
CG120 Stiff Stalk (CG65/CG33)/CG102
PHJ40 Stiff Stalk PHB09/PHB36
PHG71 Stiff Stalk/Iodent A632Ht/PH207
Inbred Male Heterotic Group Pedigree
PHG42 Iodent PH207/(PH207/PH806)
PHG29 Iodent PH207/(PH207/PH806)
PHG50 Iodent/Unrelated PH848/PH207
PHG72 Iodent PH814/PH207
PHG83 Iodent/Lancaster/Unrelated PH814/PH207
PH207 Iodent PH3BD2/PHG3RZ1
CG44 Iodent PHJ40/PHG72
CG60 Iodent PHJ40/PHR25
CG85 Iodent PHJ40/PHG72
CG108 Iodent PHJ40/PHR25
61
4.3.2 Molecular marker data
Genotyping-by-sequencing (GBS) data for the 21 inbred lines used in NCII was
generated by the Genomic Diversity Institute at Cornell, using the method of Elshire et al. (2011)
and Romay et al. (2013). To create the marker data for in silico mapping, the SNPs were filtered
for maf ≥ 0.05 and call rate of 100% (i.e., no missing data) using PLINK v1.9 (Purcell et al.
2007). SNPs were coded for the number of minor alleles (0,1,2) for the purpose of estimating
additive effects (Purcell et al. 2007). Ideograms, showing the distribution of the markers over the
chromosomes, were created using PhenoGram (Wolfe et al. 2013), with sizes of maize
chromosomes and centromere positions from Maize GBD (Andorf et al. 2010).
The GBS data was also used to construct the genomic relationship matrix (G) for each
heterotic pattern. For the inbred lines in each heterotic pattern, SNPs were filtered for minor
allele frequency (maf) ≥ 0.05 and minimum call rate of 10%, with 173,490 SNPs passing
filtering criteria for Iodent lines and 159,180 SNPs for Stiff Stalk lines. For each heterotic
pattern, the G matrix, reflecting the allelic similarity, or identity-by-state (IBS), between the
inbred lines, was created using the method of VanRaden (VanRaden 2008). In order to facilitate
inversion of the G matrices, a value of 10 was added to diagonal elements. The genomic
relationship coefficients, as presented in the results, were calculated from coefficients in the G
matrix, using the following formula for individuals i and j:
𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑜𝑓 𝑖, 𝑗 = 𝐺𝑖𝑗
√𝐺𝑗𝑗 ∗ √𝐺𝑖𝑖
62
4.3.3 Mixed Model Analysis
Broadsense heritability for grain yield, combined over years, planting densities and
replicates, was calculated by fitting the following model in ASReml 3.0 (Gilmour et al. 2009):
𝑦 = 𝑖𝑛𝑑 + 𝑒 (1)
where y is grain yield, 𝑖𝑛𝑑 is the individual hybrids and 𝑒 is the residual error (𝑒~𝑁(0, 𝐼𝜎𝑒2)). A
pedigree file was used to connect the hybrids to their Stiff Stalk and Iodent parents and contained
limited pedigree information for the parents. Broadsense heritability was estimated using the
individual and error variances: ℎ2 = 𝜎𝑖𝑛𝑑2 (𝜎𝑖𝑛𝑑
2 + 𝜎𝑒2)⁄ .
QTLs were identified using a mixed model approach based on Parisseaux and Bernardo
(2004):
𝑦 = 𝑋1𝛽 + 𝑋2𝑛 + 𝑋2𝑑 + (𝑀1𝛼1 𝑜𝑟 𝑀2𝛼2) + 𝑍1𝑔1 + 𝑍2𝑔2 + 𝑒 (2)
Where y is the vector of phenotypic observation (grain yield), β is a vector of fixed effects
(including overall mean μ, density and replicate), 𝑛 is the environment (coded for combination of
year and location) (𝑛~𝑁(0, 𝐼𝜎𝑛2)), 𝑑 is the density by environment variable (coded for
combination of density, year and location) (𝑑~𝑁(0, 𝐼𝜎𝑑2)), 𝛼1 and 𝛼2 vectors of random
additive genetic effects associated with markers in heterotic groups 1 and 2 (𝛼1~𝑁(0, 𝐼𝜎𝛼12 ),
𝛼2~𝑁(0, 𝐼𝜎𝛼22 )), 𝑔1 and 𝑔2represent random polygenic effects of heterotic groups 1 and 2
(𝑔1~𝑁(0, 𝐺1𝜎𝑔12 ), 𝑔2~𝑁(0, 𝐺2𝜎𝑔2
2 )), and e the residual variance (𝑒~𝑁(0, 𝐼𝜎𝑒2)), where I is an
identity matrix, and 𝑋1, 𝑋2, 𝑀1, 𝑀2, 𝑍1 𝑎𝑛𝑑 𝑍2 are incidence matrices of 1s and 0s relating y to
𝛽, 𝑛 and 𝑑, 𝛼1, 𝛼2, 𝑔1 and 𝑔2, respectively. Given the genetic structure of the hybrids and to
maximize the number of markers used for association mapping, each heterotic group was
analyzed separately, with either 𝑀1𝛼1 𝑜𝑟 𝑀2𝛼2 in the model. This model differs from Parisseaux
and Bernardo (2004) in the treatment of markers as random effects as well as including
63
environment and density by environment variables in the model as random effects. Previous
studies have treated environmental effects as fixed, but here they are assumed to be random
because environment is considered a random effect in NCII ANOVA analysis. The model for no
QTL is identical to equation (2) except that marker terms are removed:
𝑦 = 𝑋1𝛽 + 𝑋2𝑛 + 𝑋2𝑑 + 𝑍1𝑔1 + 𝑍2𝑔2 + 𝑒 (3)
Log likelihood (LogL) estimates for the models were computed using ASReml 3.0, using a
restricted maximum likelihood (REML) approach (Gilmour et al. 2009). The likelihood ratio test
(LRT) statistic was computing using:
𝐿𝑅𝑇 = −2(𝑙𝑜𝑔𝐿2 − 𝑙𝑜𝑔𝐿1)
P-values for the test statistics at each SNP were calculated using a 𝜒2 distribution (1 df). P-
values were adjusted for multiple testing using positive false discovery rate (FDR) using PROC
MULTTEST (SAS version 9.4 SAS Inst. Inc., Cary, NC). The predicted values of the observed
phenotype (�̂�) from model (3) were considered the adjusted phenotype. The adjusted phenotype
was modelled with all significant SNPs fit simultaneously as fixed effects using PROC GLM.
The R-squared for this model was used as an approximation for the phenotypic variance
explained by all the significant SNPs. The preceding mixed model analysis was also conducting
using pedigree-based relationship matrices in place of the G matrices.
4.4 RESULTS AND DISCUSSION
Complete analysis of variance (ANOVA) of grain yield was done previously (Holtrop
2016), Briefly, density, genotype, environment and all the interactions were significant sources
of variation. General productivity (i.e., grain yield across all environments and densities) was
controlled by mostly non-additive genetic effects. Density tolerance and yield potential,
64
however, were due to mostly additive genetic variation found exclusively among the Stiff Stalk
inbred lines. The Iodent inbred lines used in the study did not exhibit significant additive genetic
variation for grain yield when grown at different plant densities. The broadsense heritability (H2)
estimate for grain yield was H2 = 0.23, which is consistent with H2 estimates in other studies:
0.07 to 0.37 (Badu-Apraku 2010), 0.19 (Silva et al. 2013), and 0.13 to 0.24 (Hallauer et al.
2010b). Based on the NCII ANOVA analysis, the Stiff Stalk inbred lines exhibited a significant
density by additive genetic variation interaction, while the Iodent lines did not exhibit significant
additive genetic variation for either general productivity or in response to changes in plant
density. Given these initial observations, it was hypothesized that QTLs would only be detected
within the Stiff Stalk inbred lines.
4.4.1. Genomic relationship matrices
For each heterotic group, genomic relationship matrices were constructed to describe the
variance-covariance structure of the polygenic terms in the model. The G matrices for Stiff Stalk
and Iodent lines are presented in Tables 4.2 and 4.3. The off-diagonal elements reflect the
similarity between the pair of lines, based on the shared alleles between the lines, as compared to
the allelic frequency in the population. Large positive coefficients indicate high similarity
between that pair of lines. Negative coefficients can be interpreted as negative correlations. A
negative coefficient implies that the two individuals are more dissimilar than the average pair of
individuals in the dataset. Diagonal elements in the G matrix represent the degree of inbreeding,
which reflects the probability that two genes in the individual are identical by descent. In the
tabular method of pedigree-based relationships, a diagonal value of 1 is used if parents are
unrelated. Larger coefficients indicate high homozygosity in the individual.
65
To contrast the genomic relationship values with traditionally used pedigree-based
values, the pedigree records were used to construct pedigree-based relationships for Stiff Stalk
(Table 4.4) and Iodent (Table 4.5). The pedigree records of the germplasm in NCII indicate high
degrees of relatedness between many of the lines. Comparison of pedigree-based relationship
values with the G matrix shows that these genomic relationship values are consistently lower
than values obtained from pedigree records. Five of the Stiff Stalk lines are full siblings, derived
from the same hybrid, with a relationship coefficient of 0.5. The genomic relationship values
between the five full siblings ranges from -0.05 to 0.22. A set of three Stiff Stalk siblings
(CG118, CG119, CG120) are derived from a three-way cross, in which one parent, CG102, and
one grandparent, CG65, are also used in the NCII, with relationship coefficients of 0.5 and 0.25,
respectively. The genomic values between the three siblings ranges from 0.02 to 0.30. The
genomic relationship of the siblings with CG102 ranges from 0.22 to 0.35, while their
relationship with CG65 ranges from -0.31 to -0.09. For the Iodent inbreds, there are two sets of
full siblings: CG44 and CG85, and CG60 and CG108. Full sibling pedigree relationship values
are 0.5. The genomic relationship between CG44 and CG85 is 0.14, and the genomic relationship
between CG60 and CG108 is 0.31. These sets of siblings both have PHJ40 as a parent, making
them half-siblings, and also both have PH207 as a grandparent on the other side of the pedigree,
resulting a relationship coefficient of 0.31. The genomic relationships between the siblings
ranges from 0 to 0.09.
66
Table 4.2. GBS-based relationship matrix for the Stiff Stalk inbred lines.
CG37 CG38 CG57 CG79 CG65 CG102 CG118 CG119 CG120 PHG71 PHJ40
CG37 1.004
CG38 0.104 0.754
CG57 0.018 0.163 1.599
CG79 0.223 0.05 0.089 0.925
CG65 0.142 -0.026 -0.049 0.104 1.577
CG102 -0.300 -0.244 -0.342 -0.286 -0.386 1.662
CG118 -0.254 -0.235 -0.241 -0.229 -0.24 0.222 1.635
CG119 -0.299 -0.224 -0.27 -0.277 -0.309 0.349 0.298 1.181
CG120 -0.201 -0.236 -0.294 -0.224 -0.091 0.256 0.019 0.061 1.594
PHG71 -0.449 -0.448 -0.235 -0.344 0.022 0.097 0.215 0.156 0.003 1.534
PHJ40 -0.060 -0.217 -0.083 -0.193 -0.117 -0.085 -0.175 -0.221 -0.103 -0.218 2.901
Table 4.3. GBS-based relationship matrix for the Iodent inbred lines.
CG44 CG85 CG60 CG108 PH207 PHG72 PHG50 PHG83 PHG29 PHK42
CG44 0.812
CG85 0.135 2.259
CG60 -0.002 0.049 1.115
CG108 0.029 0.088 0.134 0.987
PH207 -0.256 -0.472 -0.278 -0.227 0.927
PHG72 0.045 -0.075 -0.216 -0.229 0.088 1.332
PHG50 -0.090 -0.041 0.012 -0.075 -0.389 -0.238 2.878
PHG83 -0.192 -0.214 -0.143 -0.128 -0.011 -0.232 -0.121 1.800
PHG29 -0.275 -0.459 -0.284 -0.260 0.584 -0.002 -0.348 0.014 1.025
PHK42 -0.237 -0.384 -0.272 -0.238 0.459 0.006 -0.313 -0.095 0.477 1.084
67
Table 4.4. Pedigree-based relationship matrix for the Stiff Stalk inbred lines.
CG37 CG38 CG57 CG79 CG65 CG102 CG118 CG119 CG120 PHG71 PHJ40
CG37 1
CG38 0.5 1
CG57 0.5 0.5 1
CG79 0.5 0.5 0.5 1
CG65 0.5 0.5 0.5 0.5 1
CG102 0 0 0 0 0 1
CG118 0.125 0.125 0.125 0.125 0.25 0.5 1
CG119 0.125 0.125 0.125 0.125 0.25 0.5 0.5 1
CG120 0.125 0.125 0.125 0.125 0.25 0.5 0.5 0.5 1
PHG71 0 0 0 0 0 0 0 0 0 1
PHJ40 0 0 0 0 0 0 0 0 0 0 1
Table 4.5. Pedigree-based relationship matrix for the Iodent inbred lines.
CG44 CG85 CG60 CG108 PH207 PHG72 PHG50 PHG83 PHG29 PHK42
CG44 1
CG85 0.5 1
CG60 0.31 0.31 1
CG108 0.31 0.31 0.5 1
PH207 0.25 0.25 0.25 0.25 1
PHG72 0.5 0.5 0.125 0.125 0.5 1
PHG50 0.125 0.125 0.125 0.125 0.5 0.25 1
PHG83 0.125 0.125 0.125 0.125 0.5 0.25 0.25 1
PHG29 0.19 0.19 0.19 0.19 0.75 0.38 0.38 0.38 1
PHK42 0.19 0.19 0.19 0.19 0.75 0.38 0.38 0.38 0.5 1
68
The pedigree-based and genomic relationship coefficients presented here differ due to the
base populations used to estimate population allele frequencies. The pedigree-based method
assumes a base population of all the maize lines, with population allele frequencies reflective of
the entire maize population. In contrast, the genomic relationship method uses the lines in NCII
as the base population. For both the Stiff Stalk and Iodent groups, the “populations” are very
small and consist of highly related lines, which are not reflective of the maize population as a
whole. The calculation of the genomic relationships compares the alleles shared between a pair
of lines to the frequency of the alleles in the population. Calculating relationships using these
very small populations, consisting of one heterotic pattern only, results in values that are smaller
than pedigree-based estimates, close to zero, or even negative.
To demonstrate the influence of the base population on genomic relationship values, the
genomic relationships between inbred lines were calculated using a large, diverse base
population, similar to what is assumed using pedigree estimates. Several of these inbred lines
used in the NCII were included in a genomic relationship analysis of nearly 1,200 maize lines
including public and expired proprietary germplasm from Stiff Stalk, Iodent and non-Stiff Stalk
heterotic patterns (see Chapter 3). In this analysis, the genomic relationship of CG65 with CG57
(expected 0.5) was 0.34, CG65 with CG118 and CG120 (expected 0.25) were 0.18 and 0.29
respectively, CG102 with CG118 and CG120 (expected 0.5) were 0.44 and 0.47 respectively,
and CG118 with CG120 (expected 0.5) was 0.32. These examples show that calculating the
genomic relationships using a large, diverse base population results in values that are comparable
to pedigree-based estimates. This genomic relationship matrix used in this research was derived
using the small population, with relationship coefficients reflective of the similarity between
these lines, based on the context of the small population analyzed.
69
4.4.2 QTL detection using genomic and pedigree-based matrices
This mapping experiment was conducted twice, using genomic and then pedigree-based
matrices to describe the polygenic variance-covariance structure in the model, representing the
relationships between individuals. For the Iodent germplasm, both approaches did not detect any
significant SNP associations with grain yield at different planting densities. Significant
associations were detected in the Stiff Stalk germplasm using both approaches, although the
number of significant loci differed (Figure 4.1). The scatter plots of LRT values (Figure 4.1),
show “levels” of markers with the same LRT value, which correspond to same patterns of alleles
across the inbred lines for markers within the same level (Table 4.6). While the LRT levels, and
the markers within each level, are identical between the two approaches, the number of
significant markers differs. The G matrix approach identified 123 significant associations, while
the pedigree-based relationship matrices generated an additional 224 significant associations
(Table 4.7). The SNP loci significantly associated with grain yield at different planting densities
together explain approximately 9.38% of the phenotypic variance (adjusted for environmental
effects), using the results of either the G or pedigree-based matrices.
70
Figure 4.1. Scatter plot of Likelihood Ratio Test statistic (LRT) values for Stiff Stalk markers for
additive effects for grain yield at different planting densities using the G matrix (top) and
pedigree matrix (bottom). Lines indicate FDR adjusted q-value < 0.05.
71
Table 4.6. The levels of LRT values observed in the scatter plot (Figure 2) correspond to
different patterns of alleles across the inbred lines. Level 1 indicates the highest LRT value. The
numbers for the alleles indicate the count of minor alleles (i.e. 0 = homozygous major allele, 1 =
heterozygous, 2 = homozygous minor allele).
Alleles
Inbred line Level 1 Level 2 Level 3 Level 4
CG102 0 0 0 0
CG118 0 0 0 0
CG119 0 0 0 0
CG120 0 0 0 0
CG37 0 0 0 0
CG38 2 2 2 2
CG57 2 2 2 2
CG65 0 0 0 0
CG79 0 0 0 0
PHG71 2 2 1 2
PHJ40 1 2 2 0
Table 4.7. Mapping with the pedigree-based matrix results more markers with significant
associations than mapping with the genomic relationship matrix. Likelihood ratio test (LRT)
statistics and corresponding adjusted p-value (q-value) for SNPs in the different LRT levels are
shown. Q-values < 0.05 are indicated with a *.
Genomic matrix Pedigree-based matrix
LRT level Number of SNPs LRT q-value LRT q-value
1 3 24.2 0.0012 * 24.98 0.0005 *
2 118 20.78 0.0012 * 22.5 0.0005 *
3 2 13.68 0.0478 * 15 0.0169 *
4 224 11.56 0.0541 13.7 0.0169 *
5 1 9.26 0.1857 11.44 0.0563
The observation of markers occupying distinct levels of LRT values in the scatter plots
(Figure 4.1), as opposed to markers being spread over many LRT values, is likely due to the
population structure of the germplasm, in which there are limited combinations of alleles
between individuals. These limited combination of alleles may be due to the small sample size,
inbreeding and high level of relatedness of the lines used. The genotypes of markers with
72
significant associations were cross-referenced with the grain yield performance of each inbred
line at the three planting densities (Table 4.8). For low density (37 k ha-1), there appeared to be
no effect of genotype on grain yield. However, at commercial and high density (74 k ha-1 and
148 k ha-1, respectively), the lines homozygous for the major allele had positive gi values (with
the exception of CG79 at 148 k ha-1), while lines that are heterozygous or homozygous for the
minor allele had negative gi values. This pattern suggests that homozygosity for the major allele,
at these detected loci, confers favourable density tolerance at commercial and high densities.
Table 4.8. For markers in the top three LRT levels, homozygosity for the major allele is
associated with high grain yield at commercial (74 k ha-1) and high (148 k ha-1) population
densities. The inbred lines are ordered according to the number of minor alleles at loci in the top
3 LRT levels (see Table 4.6). Estimates of grain yield, expressed as gi, are shown for each inbred
line for each planting density (Holtrop 2016). These estimates reflect the difference between the
average yield of all progeny of a parental line and the average yield of all hybrids grown in an
environment.
Inbred line Allele in top
3 LRT levels
gi (Mg ha-1)
37,000 ha-1 74,000 ha-1 148,000 ha-1
CG102 0 -0.41 0.59 0.79
CG118 0 0.33 0.53 0.31
CG119 0 -0.02 0.32 0.52
CG120 0 -0.05 0.31 0.80
CG37 0 -0.10 0.11 0.56
CG65 0 0.36 0.07 0.26
CG79 0 0.37 0.23 -0.07
PHJ40 2/1 -0.17 -0.17 -0.51
PHG71 2/1 -0.07 -0.48 -0.90
CG38 2 -0.18 -0.62 -0.72
CG57 2 -0.06 -0.90 -1.03
Grand Mean 7.99 9.95 9.04
LSD gi(0.05) 0.24 0.24 0.24
LSD gi- gj (0.05) 0.35 0.35 0.35
73
In this study, the G and pedigree-based approaches for generating the relationship matrix
produced very similar QTL mapping results. The two approaches generated slightly different
LRT values, with this difference resulting in different q-values for the markers (Table 4.7). The
G matrix approach was more stringent, with a smaller number of significantly associated loci
detected. The LogL values generated by ASReml, describing the fit of the model to the data,
were nearly identical between the two approaches, with the G matrix (LogL-4302.30) having a
slightly better fit than the relationship matrix (LogL -4302.48). With the increasing availability
of genetic data, the use of markers to generate a relationship matrix provides an alternative to the
traditional approach of using pedigree records. This research suggests that in cases where
pedigree records are limited or unknown, the G matrix is a suitable replacement, generating very
similar results to the pedigree-based matrix.
It has been proposed that deriving the relationship matrix from genetic data actually
produces more accurate values than using pedigree-based estimates, because marker data can
account for Mendelian sampling, giving more precise estimates than pedigree records (Hayes et
al. 2009; Pryce et al. 2012), and because pedigree records commonly contain errors, generating
inaccurate relationship values (VanRaden 2008). In the present study, the use of a G matrix
conferred little advantage over the pedigree-based matrix in terms of the fit of the model but did
generate a more stringent output of significant SNP associations. Despite the differences in the
relationship coefficients produced from the two methods, the use of a G matrix over the
pedigree-based matrix did not drastically alter the results, which may be because the marker
alleles explained a larger amount of the phenotypic variance than either relationship matrix. The
relative importance of the relationship matrix in explaining the phenotypic variance, compared
other terms in the model, may be affected by the germplasm used or the phenotypic trait
74
analyzed. A direction for future research is to compare the mapping results of G and pedigree-
based matrices for an NCII using different germplasm or a different phenotypic trait. The
following sections of this paper focus on the results from the G matrix analysis.
4.4.3 Informative SNPs used for in silico mapping
In this mixed model approach, the heterotic groups were treated separately for testing
marker associations. Filtering resulted in 27,398 SNPs for Stiff Stalk and 32,401 for Iodent used
for testing marker associations, which are distributed across all the maize chromosomes (Figure
4.2). Of the SNPs passing filtering in each heterotic group, only 11,383 SNPs were common to
both groups. The large number of markers unique to each group may be due to the
presence/absence variation in the maize genome, in which some genomic regions may be present
only in one heterotic group, or due to uncalled (missing) GBS data in one of the groups.
Testing marker associations in the groups separately allowed for the testing of many
more associations than if only shared SNPs were tested. The drawback of using SNPs unique to
one heterotic pattern is that additive by additive effects cannot be modelled and tested. In this
study, the 11,383 SNPs were used to test for additive by additive effects, but no significant
associations were detected. It is proposed that the SNPs that are unique to a heterotic pattern are
in fact the most useful, rather than the ones that are shared. There may be certain traits or QTLs
that have been selected for in the different heterotic groups independently, since these groups of
germplasm have been kept genetically distinct in proprietary breeding programs.
75
(A) (B)
Figure 4.2. Ideogram illustrating the genome coverage of the SNP markers used for in silico
mapping. Black bands show the position of (A) 27,398 SNPs in Stiff Stalk inbred lines and (B)
32,401 SNPs in Iodent inbred lines.
4.4.4. QTL detection and NCII
The outcome of NCII ANOVA (Holtrop 2016), indicated that Stiff Stalk inbred lines
exhibited a significant density by additive genetic variation interaction but do not have
significant additive effects for general productivity. There was no significant additive genetic
variation in Iodent lines for general productivity or for density by additive genetic variation
interaction. Consistent with the ANOVA results, this QTL mapping study detected significant
associations for grain yield at different planting densities in Stiff Stalk inbreds, but did not detect
any significant associations in Iodent lines.
This study demonstrates the use of NCII for QTL mapping, which allows mapping results
to be cross-referenced with the results of NCII ANOVA. Significant associations were detected
in the Stiff Stalk background using the mixed model approach, but not in the Iodent background,
76
despite a large number of SNPs used to test marker associations (Figure 4.2B). While it is not
unusual for a QTL to be detected in one heterotic pattern only, the absence of detectable QTLs in
a heterotic pattern has not been reported in previous studies (Parisseaux and Bernardo 2004; Van
Eeuwijk et al. 2010). It appears that the Iodent germplasm was not optimal for QTL mapping for
grain yield, considering the lack of significant additive genetic variation reported in the NCII
ANOVA. Additionally, the average performance of half of the Iodent lines was not significantly
different than the overall average for grain yield (Holtrop 2016), which greatly reduces the
opportunity to identify loci related to above and below average yields. By using NCII germplasm
for a QTL mapping experiment, the NCII ANOVA can be used as a screening tool to determine
if there is sufficient additive genetic variation in the germplasm to warrant QTL mapping.
The results of the NCII ANOVA indicated that only source of additive genetic variation
for grain yield was the Stiff Stalk by density interaction. Given this, the detected QTLs are not
QTLs for general productivity but instead are QTLs for the interaction of planting density and
grain yield. In this study, the detected grain yield QTLs in Stiff Stalk germplasm are more
accurately described as QTLs for additive genetic effects for the impact of planting density on
grain yield. In this way, the use of NCII ANOVA results to detect the source of the additive
genetic variation in the germplasm facilitates a greater understanding of QTLs detected in the
QTL mapping experiment.
4.4.5. QTLs detected in Stiff Stalk
The mixed model approach with the G matrix identified 123 SNPs in Stiff Stalk
associated with additive effects for grain yield that together explain approximately 9.38% of the
phenotypic variance, adjusted for environmental effects. These SNPs are located on five
chromosomes (Figure 4.3), with the positions of the significantly associated SNPs listed in
77
Supplemental Table 1. The SNPs with significant associations are located in bins 1.04, 1.10,
3.02, 3.04, 3.05, 3.08, 5.03, 9.02, 10.02, 10.03, 10.04, 10.05, using bins reported by Maize GDB
(Andorf et al. 2010).
Figure 4.3. Ideogram showing the chromosomal locations of the SNPs that were significantly
associated with regions influencing grain yield in the Stiff Stalk inbred lines.
It is expected that yield, a quantitative trait, would be controlled by many QTLs, each
with small effects. When mapping with elite germplasm, large effect QTL are unlikely to be
segregating in the population, since major favourable alleles are likely already fixed. Considering
the close genetic relationships among the Stiff Stalk lines, there are likely large linkage
disequilibrium (LD) blocks within the population. Detected SNPs that are found in close
proximity could therefore be referred to more accurately as detected LD regions, assuming that
the detected SNPs within an LD block are linked to the underlying QTL(s) within the LD block.
78
The detected SNPs, likely reflecting LD regions, are located in bins on chromosomes 1
(bin 1.04), 3 (3.02, 3.04, 3.05, 3.08), 5 (5.03), 9 (9.02) and 10 (10.02, 10.03, 10.04, 10.05).
Maize grain yield QTLs have been previously reported in bins 1.04 (Austin and Lee 1996;
Messmer et al. 2009), 1.10 (Ribaut et al. 1997; Melchinger et al. 1998), 5.03 (Beavis et al. 1994;
Melchinger et al. 1998; Nikolić et al. 2012), 10.02 (Kozumplik et al. 1996), 10.03 (Stuber et al.
1992; Ribaut et al. 1997), and 10.04 (Stuber et al. 1992; Ajmone-Marsan et al. 1995; Ajmone-
Marsan et al. 1996; Melchinger et al. 1998), using Maize GBD (Andorf et al. 2010). This
research appears to be the first to report QTLs for grain yield in bins 9.02, 10.05 and the bins
detected on chromosome 3. Validation of QTLs through comparison with other studies can be
difficult due to differences in germplasm and markers used.
These SNPs with significant associations are represented by 56 gene models, 8 of which
have putative functions, according to Maize GDB (Andorf et al. 2010). These gene models
include six transcription factors (outer cell layer1, c2c2-gata TF31, c2h2 TF235, NAC TF67,
MYB TF134, bHLH TF153), ATPase1 and glutathione transporter1 (Supplemental Table 1).
These genes function in plant development as well as in response to biotic and abiotic stress.
While these SNPs only have a statistical association with grain yield, and not a biological one,
transcription factors involved in abiotic and biotic stress response are likely candidates to
influence grain yield.
4.4.6 Discussion of the mixed model approach
This paper demonstrated association mapping for the complex trait of grain yield using a
small sample size of 110 lines. Previous in silico mapping studies have used larger sample sizes
of: 404 inbred lines (Zhang et al. 2005), 1,700 hybrids (van Eeuwijk et al. 2010) and 22,774
hybrids (Parisseaux and Bernardo 2004). This experiment was expected to have lower power to
79
detect QTLs than previous studies due to a smaller sample size and a large number of QTL
underlying the trait (Yu et al. 2005). However, the use of high marker coverage is able to
increase the power of QTL detection (Yu et al. 2005). The present study demonstrates that QTL
mapping for a complex trait, using a small sample size, can be achieved using the high marker
coverage of GBS data.
The method presented here for mapping with a NCII population structure is applicable to
other maize breeding programs as well as any crop utilizing heterosis. The mixed model
approach is flexible and can be used for mapping multiple phenotypic traits simultaneously
(Malosetti et al. 2008) or for a multi-QTL analysis using a Bayesian approach (van Eeuwijk et al.
2010). In addition, QTL by environment effects can be investigated by including data from
weather stations (Boer et al. 2007). While this mixed model included only additive effects,
mixed models can include a dominance term, as demonstrated by Yu et al. (2005), for studies
investigating the mechanism of heterosis or the prediction of hybrid performance. This approach
is applicable to any breeding program using NCII, to increase the efficiency of QTL detection in
their elite germplasm, and is applicable for any phenotypic trait of interest in the hybrid
germplasm.
4.6 CONCLUSIONS
This paper describes a mixed model approach for in silico mapping in maize with NCII
using GBS data. Using in silico mapping with a G matrix to describe relationships between lines,
123 SNPs associated with approximately 12 genomic regions showing additive effects for grain
yield in elite Stiff Stalk germplasm were identified. Collectively, these regions explained
approximately 9.38% of the phenotypic variance observed in grain yield. Several of these
genomic regions are novel, including QTLs in bins 9.02, 10.05 and four bins on chromosome 3.
80
Mapping using both pedigree-based and GBS-derived relationship matrices demonstrated that
these relationship matrices can be interchanged for similar results. The NCII ANOVA, showing
no significant additive effects in Iodent lines, but showing significant density by additive
variation in Stiff Stalk lines gives strength to the in silico mapping results. The approach
described here illustrates the integration of QTL mapping with maize breeding program data
using a common mating design and the utility of a GBS relationship matrix.
81
CHAPTER 5: GENERAL DISCUSSION
5.1 APPLICATIONS OF GENOTYPING BY SEQUENCING DATA IN MAIZE
BREEDING PROGRAMS
This thesis describes several methods of GBS data analysis using data from the
University of Guelph maize breeding program, in conjunction with publically available data.
This research compared the effectiveness of a newly described method, the network diagram, to
the commonly used hierarchical clustering method for analyzing population structure. The
network diagram was shown to be more effective for generating U.S. field corn heterotic patterns
than hierarchical clustering using Ward’s method. While hierarchical clustering will place all
lines in the dendrogram with no reflection of the strength of the connection, the network diagram
allows for lines to be placed outside of or between key clusters and can identify the lines at the
core of each cluster. This research expanded on the work of Romay et al. (2013) by demonstrated
that this method is effective for large datasets (nearly 1,200 lines) and datasets including ex-PVP,
public and University of Guelph germplasm. This study also demonstrated that GBS data has
applications for identifying close relatives of maize lines using hierarchical clustering, and
described several outcomes of this approach. Finally, GBS data was used to determine the extent
of parental contribution as well as to identify parental genome segments in progeny derived from
breeding crosses. The identification of parental genome segments in progeny from inter-heterotic
crosses can be used to investigate the nature of heterotic patterns and heterosis. This method can
also be applied to intra-heterotic crosses used in the breeding program, to determine favourable
linkage blocks that have been selected in many progeny.
With the large amount of genotypic data becoming available, methods to apply this data
to benefit maize breeding programs are lacking. This thesis meets that need by describing
methods that are suitable for any maize population to benefit a breeding program and also serve
82
as a foundation for further development of methods to analyze GBS data. As computational
methods become available, it is important to evaluate the success of new approaches against the
old, more common methods of analysis. The discovery that the network diagram was superior to
the common method of hierarchical clustering using Ward’s method, in terms of reflecting maize
heterotic patterns, demonstrates the importance of evaluating new methods as they are
developed. The techniques described in this thesis will allow researchers to apply GBS data to
maize breeding programs in novel ways, facilitating a greater understanding of maize germplasm
as well as benefiting breeding programs.
5.2 IN SILICO MAPPING OF MAIZE GRAIN YIELD QTLS USING STRUCTURED
CROSSES
In this thesis, a mixed model approach was used for in silico mapping in maize with NCII
using GBS data. With the genomic relationship matrix, this study identified 123 significant
associations for additive genetic effects for impact of planting density on grain yield in elite Stiff
Stalk germplasm lines. Underlying QTLs together explained approximately 9.38% of the
phenotypic variance, adjusted for environmental effects. Several of these genomic regions are
novel, including QTLs in bins 9.02, 10.05 and four bins on chromosome 3. While previous
reports of in silico mapping utilized pedigree records to construct a relationship matrix for each
heterotic group, this study showed that GBS data can be used to create a genomic relationship
matrix that can be interchanged for the relationship matrix with near identical results. This
suggest that the relationship matrix can be made using GBS data alone, which is especially
beneficial for sets of germplasm with unknown or partially missing pedigree records.
83
While previous studies have conducted in silico association mapping using hybrid data,
this study was the first to investigate QTL mapping using structured crosses. This allowed for
NCII ANOVA results to be cross-referenced with the results of the QTL mapping experiment.
The observed lack of QTLs detected in Iodent lines was consistent with the NCII ANOVA
analysis indicating that there were no significant additive effects for yield in the Iodent
background. The detection of significant associations in the Stiff Stalk germplasm lines was
consistent with the analysis reporting significant additive by density effects for yield. These
results suggest that an NCII ANOVA can be used as a screening tool, identifying germplasm
with insufficient additive genetic variation prior to conducting the QTL mapping experiment,
thus increasing the efficiency of the research.
Since the detected SNP markers are likely to be linked to each other and the underlying
QTL(s), they are more accurately described as detected LD regions. The next step with this
research is to construct the haplotypes of the Stiff Stalk germplasm, to identify that LD regions
containing the detected SNPs. This task requires ancestral pedigree information for the Stiff
Stalk lines, which is lacking in this study, as well as marker data for these ancestral lines. A
limitation of using GBS data is the prevalence of missing data, as missing data in one or both of
the parents at a loci could pose a challenge in the generation of haplotypes.
This study identified SNPs, likely belonging to LD regions, linked to QTLs for additive
genetic effects for impact of planting density on grain yield. Since one of the key traits maize
breeders have selected for in the past 100 years is density tolerance, it is reasonable to assume
that the maize genome contains signatures of selection, or selective sweeps, for density tolerance.
If signatures of selection for density tolerance have been detected in maize, these could be cross-
referenced with the results of this study to validate the genomic regions detected in this study.
84
The decreasing cost of marker data and increased efficiency of computational analysis
allows for novel methods of basic and applied genetic research to be developed. As an alternative
to traditional QTL mapping approaches, this thesis demonstrates an approach with wide
applications for any breeding program utilizing NCII designs, allowing breeders to increase the
efficiency of their breeding programs by integrating QTL mapping with existing phenotypic data
from structured crosses. This method allows for cost-effective QTL discovery in elite germplasm
with results that are directly applicable to the breeding program.
85
REFERENCES
Ajmone-Marsan, P., G. Monfredini, A. Brandolini, A.E. Melchinger, G. Garay and M. Motto.
1996. Identification of QTL for grain yield in an elite hybrid of maize: Repeatability of
map position and effects in independent samples derived from the same population.
Maydica 41:49-57.
Ajmone-Marsan, P., G. Monfredini, W.F. Ludwig, A.E. Melchinger, P. Franceschini, G.
Pagnotto, and M. Motto. 1995. In an elite cross of maize a major quantitative trait locus
controls one-fourth of the genetic variation for grain yield. Theoretical Applied Genetics
90:415-424.
Andorf, C.M., C.J. Lawrence, L.C. Harper, M.L. Schaeffer, D.A. Campbell, and T.Z. Sen. 2010.
The Locus Lookup tool at MaizeGDB: identification of genomic regions in maize by
integrating sequence information with physical and genetic maps. Bioinformatics 26:
434-436.
Austin, D.F. and M. Lee. 1996. Comparative mapping in F-2:3 and F-6:7 generations of
quantitative trait loci for grain yield and yield components in maize. Theoretical Applied
Genetics 92:817-826.
Badu-Apraku, B. 2010. Effects of recurrent selection for grain yield and Striga resistance in an
extra-early maize population. Crop Science 50:1735–1743.
Barata, C, and M. J. Carena. 2006. Classification of North Dakota maize inbred lines into
heterotic groups based on molecular and testcross data. Euphytica 151:339–349.
Bastian, M., S. Heymann, add M. Jacomy. 2009. Gephi: an open source software for
exploring and manipulating networks. International AAAI Conference on Weblogs and
Social Media.
Beavis, W.D., O.S. Smith, D. Grant, and R. Fincher. 1994. Identification of quantitative trait loci
using a small sample of topcrossed and F4 progeny from maize. Crop Science 34:882-
896.
Bernardo, R. 2001. Breeding potential of intra- and interheterotic group crosses in maize. Crop
Science 41:68–71.
Bernardo, R., J. Romero-Severson, J. Zieglem J. Hauser, L. Joe, G. Hookstra, and R.W. Doerge.
2000. Parental contribution and coefficient of coancestry among maize inbreds: pedigree,
RFLP, and SSR data. Theoretical Applied Genetics 100:552–556.
Bernardo, R., and J. Yu. 2007. Prospects for genomewide selection for quantitative traits in
maize. Crop Science 47:1082–1090.
86
Bink, M.C.A.M., L.R. Totir, C.J.F. ter Braak, C.R. Winkler, M.P. Boer, and O.S. Smith. 2012.
QTL linkage analysis of connected populations using ancestral marker and pedigree
information. Theoretical and Applied Genetics 124: 1097–1113.
Birchler, J.A., D.L. Auger, and N.C. Riddle. 2003. In search of the molecular basis of heterosis.
The Plant Cell 15:2236-2239.
Boer, M.P., D. Wright, L.Z. Feng, D.W. Podlich, L. Luo, M. Cooper, and F.A. van Eeuwijk.
2007. A mixed-model quantitative trait loci (QTL) analysis for multiple-environment trial
data using environmental covariables for QTL-by-environment interactions, with an
example in maize. Genetics 177:1801–1813.
Bradbury, P.J., Z. Zhang, D.E. Kroon, T.M. Casstevens, Y. Ramdoss, and E.S. Buckler. 2007.
TASSEL: software for association mapping of complex traits in diverse samples.
Bioinformatics 23:2633-2635.
Bradley, J.P., K.H. Knittle, A.F. and Troyer. 1988. Statistical methods in seed corn production
selection. Journal of Production Agriculture 1: 34-38.
Bu, S.H., X. Zhao, C. Yi, J. Wen, J. Tu, and Y.M. Zhang. 2015. Interacted QTL mapping in
partial NCII design provides evidences for breeding by design. PLoS ONE 10:
e0121034.
Burt, A.J., C.M. Grainger, M.P. Smid, B.J. Shelp, and E.A. Lee. 2011. Allele mining in exotic
maize germplasm to enhance macular carotenoids. Crop Science 51:991-1004.
Cadzow, M., J. Boocock, H.T. Nguyen, P. Wilcox, T.R. Merriman, and M.A. Black. 2014. A
bioinformatics workflow for detecting signatures of selection in genomic data. Frontiers
in Genetics 5:293-300.
Civardi, L., Y. Xia, K.J. Edwards, P.S. Schnable, and B.J. Nikolau. 1994. The relationship
Between genetic and physical distances in the cloned a1-sh2 interval of the Zea mays L.
genome. PNAS 91:8268-8272.
Comstock, R.E., and H.F. Robinson. 1952. Estimation of average dominance of genes. p. 494-
516. In J.W. Gowen (ed.) Heterosis. Iowa State College Press, Ames.
Crepieux, S., C. Lebreton, B. Servin, and G. Charmet. 2004. Quantitative trait loci (QTL)
detection in multicross inbred designs: recovering QTL identical-by-descent status
information from marker data. Genetics 168: 1737–1749.
Crossa, J., G. de los Campos, P. Pérez, D. Gianola, J. Burgueño, J.L. Araus, D. Makumbi, R. P.
87
Singh, S. Dreisigacker, J. Yan, V. Arief, M. Banziger, and H.-J. Braun. 2010. Prediction
of genetic values of quantitative traits in plant breeding using pedigree and molecular
markers. Genetics 186: 713–724.
Crossa, J., Y. Beyene, S, Kassa, P. Pérez, J.M. Hickey, C. Chen, G. de los Campos, J. Burgueño,
V.S. Windhausen, E. Buckler, J.-L. Jannink, M.A. Lopez Cruz, and R. Babu. 2013.
Genomic prediction in maize breeding populations with Genotyping-by-Sequencing. G3
(Bethesda) 3:1903-1926.
Crow, J.F. 1998. 90 Years ago: The beginning of hybrid maize genetics 148: 923–928.
Danecek, P., A. Auton, G. Abecasis, C.A. Albers, E. Banks, M.A. DePristo, R. Handsaker, G.
Lunter, G. Marth, S.T. Sherry, G. McVean, R. Durbin, and 1000 Genomes Project
Analysis Group. 2011. The Variant Call Format and VCFtools. Bioinformatics 27: 2156-
2158.
Darrah, L.L. and M.S. Zuber 1986. 1985 United States farm maize germplasm base and
commercial breeding strategies. Crop Science 26:1109–1113.
de Koning, D.J., R. Pong-Wong, L. Varona, G.J. Evans, E. Giuffra, A. Sanchez, G. Plastow, J.L.
Noguera, L. Andersson, C.S. Haley. 2003. Full pedigree quantitative trait locus analysis
in commercial pigs using variance components. Journal of Animal Science 81: 2155–
2163.
Duvick, D.N. 2001. Biotechnology in the 1930s: The development of hybrid maize. Nature
Reviews Genetics 2: 69–74.
Duvick, D.N. 2005. The contribution of breeding to yield advances in maize (Zea mays L.).
Advances in Agronomy 86: 83–145.
East, E.M. 1909. Inbreeding in corn, 1907. In: Connecticut Agric Exp Stn Rep, pp 419–428
Elshire R.J., J.C. Glaubitz, Q. Sun, J.A. Poland, K. Kawamoto, E.S. Buckler, and S.E. Mitchell.
2011. A robust, simple Genotyping-by-Sequencing (GBS) approach for high diversity
species. PLoS One 6:e19379.
Emik, L.O., and C.E. Terrill. 1949. Systematic procedures for calculating inbreeding
coefficients. Journal of Heredity 40: 51-55.
FAO. Food and Agriculture Organization of the United Nations statistics division.
Production/Crops. Latest update: 2015. Accessed January 2016. url:
http://faostat3.fao.org/home/E
88
George, A. W., P.M. Visscher, and C.S. Haley. 2000. Mapping quantitative trait in complex
pedigrees: a two-step variance component approach. Genetics 156: 2081–2092.
Gilmour, A.R., B.J. Gogel, B.R. Cullis, and R. Thompson. 2009. ASReml User Guide Release
3.0. VSN International Ltd, Hemel Hempstead, HP1 1ES, UK www.vsni.co.uk
Grupe, A., S. Germer, J. Usuka, D. Aud, J. K. Belknap, R.F. Klein, M.K. Ahluwalia, R. Higuchi,
and G. Peltz. 2001. In silico mapping of complex disease-related traits in mice. Science
292: 1915–1918.
Guo, J., Z. Chen, Z. Liu, B. Wang, W. Song, W. Li, J. Chen, J. Dai, and J. Lai. 2011.
Identification of genetic factors affecting plant density response through QTL mapping of
yield component traits in maize (Zea mays L.). Euphytica 182: 409-422.
Hallauer, A., J. Miranda Filho, and M. Carena. 2010a. Quantitative genetics in maize breeding.
3rd ed. Iowa State Univ. Press, Ames, IA.
Hallauer, A.R., M.J. Carena, and J.M. Filho. 2010b. Hereditary variance: Experimental
estimates. In: Quantitative Genetics in Maize Breeding. Springer, New York, pp. 169-
222.
Hansey, C.N., J.M. Johnson, R.S. Sekhon, S.M. Kaeppler, and N. de Leon. 2011. Genetic
diversity of a maize association population with restricted phenology. Crop Science
51:704–715.
Hansey, C.N., B. Vaillancourt, R.S. Sekhon, N. de Leon, S.M. Kaeppler, and C.R. Buell. 2012.
Maize (Zea mays L.) genome diversity as revealed by RNA-sequencing. PLoS ONE
7:e33071.
Hayes, B.J., P.M Visscher, and M.E. Goddard. 2009. Increased accuracy of artificial selection by
using the realized relationship matrix. Genetics Research 91: 47–60.
Henderson, C.R. 1976. A simple method for computing the inverse of a numerator relationship
matrix used in prediction of breeding values. Biometrics 32: 69-83.
Ho, C., R. McCouch, and E. Smith. 2002. Improvement of hybrid yield by advanced backcross
QTL analysis in elite maize. Theoretical Applied Genetics 105:440-448.
Holtrop, A.T. Genetic architecture for yield potential, density tolerance, and yield stability in
maize (Zea mays L). MSc thesis, University of Guelph, 2016.
89
Jones, D.F. 1918. The effects of inbreeding and crossbreeding on development. In: Conn Agric
Exp Stn Bull 207, pp 5–100.
Kozumplik, V., I. Pejic, L. Senior, R. Pavlina, G. Graham, and C.W. Stuber. 1996. Use of
molecular markers for QTL detection in segregating maize populations derived from
exotic germplasm. Maydica 41: 211-217.
Lee, E.A., and L.W. Kannenberg. 2004. Effect of inbreeding method and selection criteria on
inbred and hybrid performance. Maydica 49:191-197.
Lee, E.A. and M. Tollenaar. 2007. Physiological basis of successful breeding strategies for
maize grain yield. Crop Science 47:S202-S215.
Lee, E.A. M.J. Ash, and B. Good. 2007. Re-examining the relationship between degree of
relatedness, genetic effects and heterosis in maize. Crop Science 47:629-635.
Lee, E.A., A. Singh, M.J. Ash, and B. Good. 2006. Use of sister-lines and the performance of
modified single-cross maize hybrids. Crop Science 46:312-320.
Li, C., Y. Li, P.J. Bradbury, X. Wu, Y. Shi, Y. Song, D. Zhang, E. Rodgers-Melnick, E.S.
Buckler, Z. Zhang, Y. Li, and T. Wang. 2015a. Construction of high-quality
recombination maps with low-coverage genomic sequencing for joint linkage analysis in
maize. BMC Biology 13:78-89.
Li, Y.-X., X. Wu, J. Jaqueth, D. Zhang, D. Cui, C. Li, G. Hu, H. Dong, Y.C. Song, Y.-S. Shi, T.
Wang, B. Li, and Y. Li. 2015b. The identification of two head smut resistance-related
QTL in maize by the joint approach of linkage mapping and association analysis. PLoS
One 10: e0145549.
Liu, K., M. Goodman, S. Muse, J.S. Smith, E. Buckler, and J. Doebley. 2003. Genetic
structure and diversity among maize inbred lines as inferred from DNA microsatellites.
Genetics 165: 2117–2128.
Liu, C., Z. Hao, D. Zhang, C. Xie, M. Li, X. Zhang, H. Yong, S. Zhang, J. Weng, and X. Li.
2015. Genetic properties of 240 maize inbred lines and identity-by-descent segments
revealed by high-density SNP markers. Molecular Breeding 35:146.
Malosetti, M., J.M. Ribaut, M. Vargas, J. Crossa, and F.A. van Eeuwijk. 2008. A multi-trait
multi-environment QTL mixed model with an application to drought and nitrogen stress
trials in maize (Zea mays L.). Euphytica 161:241–257.
Melchinger, A.E., H.F. Utz and C.C. Schön. 1998. Quantitative trait locus (QTL) mapping using
different testers and independent population samples in maize reveals low power of QTL
detection and large bias in estimates of QTL effects. Genetics 149:383-403.
90
Messmer, R., Y. Fracheboud, M. Banziger, M. Vargas, P. Stamp, and J. Ribaut. 2009. Drought
stress and tropical maize: QTL-by-environment interactions and stability of QTLs across
environments for yield components and secondary traits. Theoretical Applied
Genetics 119:913-930.
Meuwissen, T.H.E., B.J. Hayes, and M.E. Goddard. 2001. Prediction of total genetic values
Using genome-wide dense marker maps. Genetics 157:1819–1829.
Mezmouk, S. and J. Ross-Ibarra. 2014. The pattern and distribution of deleterious mutations in
maize. G3 (Bethesda) 4: 163–171.
Mikel, M.A. and J.W. Dudley. 2006. Evolution of North American dent corn from public
proprietary germplasm. Crop Science 46: 1193- 1205.
Mikel, M.A. 2008. Genetic diversity and improvement of contemporary proprietary North
American dent corn. Crop Science 48: 1686-1695.
Montgomery, E.G. 1916. The corn crops. Macmillan, New York.
Moose, S.P., and R.H. Mumm. 2008. Molecular plant breeding as the foundation for 21st
century crop improvement. Plant Physiology 147:969–977.
Mumm, R.H., L.J. Hubert, and J.W. Dudley. 1994. A classification of 148 U.S. maize inbreds: II.
Validation of cluster analysis based on RFLPs. Crop Science 34:852-865.
Nelson, P.T., N.D. Coles, J.B. Holland, D.M. Bubeck, S. Smith, and M.M. Goodman. 2008.
Molecular characterization of maize inbreds with expired U.S. Plant Variety Protection.
Crop Science 48:1673-1685.
Nelson, B.K., A.L. Kahler, J.L. Kahler, M.A. Mikel, S.A. Thompson, R.S. Ferriss, S. Smith, and
E.S. Jones. 2011. Evaluation of the numbers of single nucleotide polymorphisms required
to measure genetic distance in maize (Zea mays L.). Crop Science 51: 1470-1480.
Nikolić, A., D. Ignjatović-Micić, D. Dodig, V. Anđelković, and V.Lazić-Jančić. 2012.
Identification of QTLs for yield and drought-related traits in maize: assessment of their
causal relationships. Biotechnology and Biotechnological Equipment 26:2952-2960.
Parisseaux, B. and R. Bernardo. 2004. In silico mapping of quantitative trait loci in maize.
Theoretical and Applied Genetics 109: 508–514.
Piepho, H. P., 2009 Ridge regression and extensions for genomewide selection in maize. Crop
Science 49: 1165–1176.
91
Pryce, J.E., B.J. Hayes, and M. E. Goddard. 2012. Novel strategies to minimize progeny
inbreeding while maximizing genetic gain using genomic information. Journal of Dairy
Science 95: 377–388.
Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M.A.R. Ferreira, D. Bender, J. Maller, P.
Sklar, P.I.W. De Bakker, M.J. Daly, and P.C. Sham. 2007. PLINK: a toolset for whole-
genome association and population-based linkage analysis. American Journal of Human
Genetics 81:559-575.
R Core Team. 2015. R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Ribaut, J.-M., C. Jiang, D. Gonzalez-de-Leon, G. O. Edmeades, and D. A. Hoisington. 1997.
Identification of quantitative trait loci under drought conditions in tropical maize. 2.
Yield components and marker-assisted selection strategies. Theoretical Applied Genetics
94:887-896.
Rojas, B.A. and G.F. Sprague. 1952. A comparison of variance components in corn yield trials:
III. General and specific combining ability and their interaction with locations and years.
Agronomy Journal 44:462-466.
Romay, M.C., R.A. Malvar, L. Campo, A. Álvarez, J. Moreno-González, A. Ordás, and P.
Revilla. 2010. Climatic and genotypic effects for grain yield in maize under stress
conditions. Crop Science 50: 51–58.
Romay, M.C., M.J. Millard, J.C. Glaubitz, J.A. Peiffer, K.L. Swarts, T.M. Casstevens, R.J.
Elshire, C.B. Acharya, S.E. Mitchell, S.A. Flint-Garcia, M.D. McMullen, J.B. Holland,
E.S. Buckler, and C.A. Gardner. 2013. Comprehensive genotyping of the USA national
maize inbred seed bank. Genome Biology 14:R55.
SAS/STAT software, Version 9.4 of the SAS System for Unix. Copyright © 2002-2012 SAS
Institute Inc.
Shull, G.H. 1908. The composition of a field of maize. Amer Breeders’ Assoc Rep 4:296–301
Shull, G.H. 1909. A pureline method of corn breeding. Amer Breeders’ Assoc Rep 5:51–59
Silva, F.F.E., J.M.S Viana, V.R. Faria, and M.D.V. de Resende. 2013. Bayesian inference of
mixed models in quantitative genetics of crop species. Theoretical and Applied Genetics
126:1749–1761.
Stuber, C.W., S.E. Lincoln, D.W. Wolff, T. Helentjaris, and E.S. Lander. 1992. Identification of
92
genetic factors contributing to heterosis in a hybrid from two elite maize inbred lines
using molecular markers. Genetics 132:823-839.
Tanksley, S.D., 1993. Mapping polygenes. Annual Review of Genetics 27: 205–233.
Tollenaar, M. and E.A. Lee. 2002. Yield potential, yield stability and stress tolerance in maize.
Field Crops Research 75: 161-169.
Troyer, A.F. 1999. Background of U.S. hybrid corn. Crop Science 39: 601–626.
Troyer, A.F. and M.A. Mikel. 2010. Minnesota corn breeding history: Department of Agronomy
& Plant Genetics Centennial. Crop Science 50: 1141–1150.
van Eeuwijk, F.A., M. Boer, L.R. Totir, M. Bink, D. Wright, C.R. Winkler, D. Podlich, K.
Boldman, A. Baumgarten, M. Smalley, M. Arbelbide, C.J.F. ter Braak, and M. Cooper.
2010. Mixed model approaches for the identification of QTLs within a maize hybrid
breeding program. Theoretical and Applied Genetics 120: 429-440.
VanRaden, P.M. 2008. Efficient methods to compute genomic predictions. Journal of Dairy
Science, 91: 4414-4423.
Wolfe, D., S. Dudek, M.D. Ritchie, and S.A. Pendergrass. 2013. Visualizing genomic
Information across chromosomes with PhenoGram. BioData Mining 6:18-29.
Wu, Y., F.S. Vicente, K. Huang, T. Dhliwayo, D.E. Costich, K. Semagn, N. Sudha, M. Olsen,
B.M. Prasanna, X. Zhang, and R. Babu. 2016. Molecular characterization of CIMMYT
maize inbred lines with genotyping‑by‑sequencing SNPs. Theoretical Applied Genetics
29:753–765.
Yu, J., M. Arbelbide, and R. Bernardo. 2005. Power of in silico QTL mapping from phenotypic,
pedigree, and marker data in a hybrid breeding program. Theoretical and Applied
Genetics 110: 1061–1067.
Zhang, Y.-M., Y. Mao, C. Xie, H. Smith, L. Luo, and S. Xu. 2005. Mapping quantitative trait
loci using naturally occurring genetic variance among commercial inbred lines of maize
(Zea mays L.). Genetics 169: 2267–2275.
Zhang, X., P. Pérez-Rodríguez, K. Semagn, Y. Beyene, R. Babu, M.A. López Cruz, F. San
Vicente, M. Olsen, E. Buckler, J.L. Jannink, B.M. Prasanna, and J. Crossa. 2015.
Genomic prediction in biparental tropical maize populations in water-stressed and well-
watered environments using low-density and GBS SNPs. Heredity 114:291–299.
93
Zhou, Z. C. Zhang, Y. Zhou, Z. Hao, Z. Wang, X. Zeng, H. Di, M. Li, D. Zhang, H. Yong, S.
Zhang, J. Weng, and X. Li. 2016. Genetic dissection of maize plant architecture with an
ultra-high density bin map based on recombinant inbred lines. BMC Genomics 17: 178-
192.
Zhu, C., M. Gore, E.S. Buckler, and J. Yu. 2008. Status and prospects of association mapping in
plants. The Plant Genome 1:5–20.
94
SUPPLEMENTAL TABLES AND FIGURES FOR CHAPTER 4: IN SILICO MAPPING
OF MAIZE GRAIN YIELD QTLS USING STRUCTURED CROSSES
Supplementary Table 4.1. Position, Likelihood ratio test statistic (LRT) and FDR adjusted q-
value for significant SNPs found in Stiff Stalk germplasm. Gene models containing a SNP are
listed.
chromosome position (bp)
According to
refgen v2.
LRT FDR q-
value
Gene models from Maize GBD
1 66,975,722 20.7 0.0012
1 67,077,524 20.7 0.0012
1 67,440,505 20.7 0.0012
1 67,464,758 20.7 0.0012
1 67,480,447 20.7 0.0012
1 67,993,826 20.7 0.0012
1 68,439,791 20.7 0.0012
1 68,439,797 20.7 0.0012
1 68,561,821 20.7 0.0012
1 70,247,320 20.7 0.0012
1 72,022,649 20.7 0.0012
1 276,818,314 20.7 0.0012 GRMZM2G421491/ZEAMMB73_9227
06 (gsht1 - glutathione transporter1)
3 6,976,782 20.7 0.0012
3 18,184,709 24.12 0.0012
3 18,999,799 20.7 0.0012
3 20,400,067 20.7 0.0012
3 20,417,559 20.7 0.0012
3 20,417,562 20.7 0.0012
3 21,400,755 20.7 0.0012
3 22,250,501 20.7 0.0012
3 22,570,789 20.7 0.0012
3 22,944,442 20.7 0.0012
3 23,009,927 20.7 0.0012
3 23,009,945 20.7 0.0012
3 23,009,951 20.7 0.0012
3 23,012,358 20.7 0.0012
3 26,308,887 20.7 0.0012
3 27,552,196 20.7 0.0012 GRMZM2G026643/ZEAMMB73_8173
67 (ocl1 - outer cell layer1)
95
3 27,552,199 20.7 0.0012
3 141,359,601 20.7 0.0012 GRMZM2G101020/ZEAMMB73_4291
99 (atp1 - ATPase1)
3 141,444,450 20.7 0.0012 GRMZM2G067171/ZEAMMB73_5826
36 (gata31 - C2C2-GATA-transcription
factor 31)
3 142,253,006 20.7 0.0012
3 142,261,549 20.7 0.0012
3 147,292,600 20.7 0.0012
3 147,292,623 20.7 0.0012
3 147,292,637 20.7 0.0012
3 148,291,030 20.7 0.0012
3 148,291,283 20.7 0.0012
3 153,303,132 20.7 0.0012
3 153,764,074 20.7 0.0012
3 154,124,779 20.7 0.0012
3 154,373,563 20.7 0.0012
3 154,373,583 20.7 0.0012
3 154,373,594 20.7 0.0012
3 156,275,421 20.7 0.0012
3 157,085,834 20.7 0.0012
3 211,719,240 20.7 0.0012
5 18,276,882 20.7 0.0012
5 18,276,955 20.7 0.0012
5 18,276,959 20.7 0.0012
5 18,276,960 20.7 0.0012
9 18,702,410 20.7 0.0012
9 18,765,811 20.7 0.0012
10 10,138,359 20.7 0.0012 GRMZM2G068710/ZEAMMB73_4177
49 (c2h35 - C2H2-transcription factor
235)
10 10,138,363 20.7 0.0012
10 10,170,505 20.7 0.0012
10 10,170,516 20.7 0.0012
10 10,196,528 13.62 0.0478
10 10,196,543 20.7 0.0012
10 10,201,182 20.7 0.0012
10 10,201,183 20.7 0.0012
10 10,201,193 20.7 0.0012
96
10 10,245,870 13.62 0.0478
10 10,603,533 20.7 0.0012
10 10,615,471 20.7 0.0012
10 10,620,573 20.7 0.0012
10 10,667,000 20.7 0.0012
10 10,667,383 20.7 0.0012
10 10,829,394 20.7 0.0012
10 10,829,400 20.7 0.0012
10 14,001,763 20.7 0.0012
10 14,445,038 20.7 0.0012 GRMZM2G083347/ZEAMMB73_3913
58 (nactf67 - NAC-transcription factor
67)
10 14,446,091 20.7 0.0012
10 14,446,149 20.7 0.0012
10 14,764,622 20.7 0.0012
10 14,908,455 24.12 0.0012
10 15,208,871 20.7 0.0012
10 15,208,872 20.7 0.0012
10 15,237,302 20.7 0.0012
10 15,393,034 20.7 0.0012
10 15,549,495 24.12 0.0012
10 16,031,237 20.7 0.0012
10 16,031,241 20.7 0.0012
10 16,213,666 20.7 0.0012
10 16,354,706 20.7 0.0012
10 16,354,729 20.7 0.0012
10 16,372,589 20.7 0.0012
10 16,372,595 20.7 0.0012
10 79,940,553 20.7 0.0012
10 79,987,079 20.7 0.0012
10 80,042,615 20.7 0.0012
10 80,042,635 20.7 0.0012
10 80,048,193 20.7 0.0012
10 80,492,532 20.7 0.0012
10 80,576,147 20.7 0.0012
10 83,456,644 20.7 0.0012
10 83,456,689 20.7 0.0012
97
10 85,394,188 20.7 0.0012 GRMZM2G001824/ZEAMMB73_8913
77 (myb134 - MYB-transcription factor
134)
10 86,822,065 20.7 0.0012
10 87,419,786 20.7 0.0012 GRMZM2G036554/ZEAMMB73_6532
83 (bhlh153 - bHLH-transcription factor
153)
10 91,960,025 20.7 0.0012
10 91,960,027 20.7 0.0012
10 91,960,032 20.7 0.0012
10 92,033,761 20.7 0.0012
10 93,140,324 20.7 0.0012
10 93,140,350 20.7 0.0012
10 93,140,365 20.7 0.0012
10 95,232,647 20.7 0.0012
10 95,232,657 20.7 0.0012
10 97,372,975 20.7 0.0012
10 104,831,710 20.7 0.0012
10 105,281,679 20.7 0.0012
10 110,205,731 20.7 0.0012
10 110,205,733 20.7 0.0012
10 110,298,066 20.7 0.0012
10 110,298,193 20.7 0.0012
10 110,298,195 20.7 0.0012
10 111,426,216 20.7 0.0012
10 132,245,294 20.7 0.0012
10 132,533,341 20.7 0.0012
10 132,533,404 20.7 0.0012
10 132,533,405 20.7 0.0012
10 132,533,406 20.7 0.0012