106
Analysis of Genotyping-by-Sequencing data in a maize (Zea mays L.) breeding program by Alison R. Cooke A Thesis presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in Bioinformatics © Alison R. Cooke, December, 2016

Analysis of Genotyping-by-Sequencing data in a maize (Zea

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analysis of Genotyping-by-Sequencing data in a maize (Zea

Analysis of Genotyping-by-Sequencing data in a maize (Zea mays L.)

breeding program

by

Alison R. Cooke

A Thesis

presented to

The University of Guelph

In partial fulfilment of requirements

for the degree of

Master of Science

in

Bioinformatics

© Alison R. Cooke, December, 2016

Page 2: Analysis of Genotyping-by-Sequencing data in a maize (Zea

ABSTRACT

ANALYSIS OF GENOTYPING-BY-SEQUENCING DATA IN A MAIZE (ZEA MAYS L.)

BREEDING PROGRAM

Alison R. Cooke Advisors:

University of Guelph, 2016 Dr. Andy Robinson, Dr. Elizabeth A. Lee

As genotyping-by-sequencing data becomes more abundant, new applications of this

technology in breeding programs are possible. This research utilized publically available data in

conjunction with data from the University of Guelph maize breeding program. Using a genetic

similarity matrix, a network diagram and dendrogram were generated, reflecting the maize

heterotic patterns. Hierarchical clustering was used to identify putative parents of University of

Guelph lines from a publically available dataset. For lines derived from inter-heterotic breeding

crosses, parental linkage blocks were identified and visualized across the chromosomes. The

marker data was also used in conjunction with grain yield data for association mapping using a

mixed model approach. This method identified 123 quantitative trait loci for additive effects for

grain yield in Stiff Stalk lines, accounting for approximately 9.38% of the phenotypic variance.

Breeders can increase the utility of their marker data in their breeding program by applying the

methods described here.

Page 3: Analysis of Genotyping-by-Sequencing data in a maize (Zea

iii

ACKNOWLEDGEMENTS

I would first like to thank my advisors, Dr. Liz Lee and Dr. Andy Robinson, for creating

this unique thesis project for me, combining the data analysis techniques of animal science with

phenotypic data from Liz’s maize breeding program. I am grateful for their support and guidance

with my research. They were both wonderful advisors.

Thank you to my advisory committee for providing insight from the statistical and plant

breeding perspectives. Special thanks to Dr. Gord Vandervoort, who was invaluable in assisting

me with my data analysis. He is extremely knowledgeable and always willing to offer advice

with a smile on his face.

I would like to thank the Natural Sciences and Engineering Research Council of Canada

(NSERC), and the Highly Qualified Personnel (HQP) scholarship programs for my Masters

scholarships. Finally, a thank you to Pioneer Hi-bred for granting me an internship as part of my

HQP scholarship – I learned a lot and had a great summer!

Page 4: Analysis of Genotyping-by-Sequencing data in a maize (Zea

iv

TABLE OF CONTENTS

Acknowledgements ...................................................................................................................................... iii

List of Figures .............................................................................................................................................. vi

List of Tables ............................................................................................................................................. viii

List of Abbreviations Used .......................................................................................................................... ix

Chapter 1: Introduction ................................................................................................................................. 1

Chapter 2: Literature Review and Research Objectives ............................................................................... 4

2.1 Heterotic patterns in maize breeding .................................................................................................. 4

2.1.1 Development of hybrids from open pollinated populations ......................................................... 4

2.1.2 Privatization of U.S. maize breeding ........................................................................................... 5

2.1.3 The role of heterotic patterns in maize breeding .......................................................................... 5

2.1.4 U.S. heterotic patterns and founding inbred lines ........................................................................ 6

2.2 Use of genetic markers to investigate relationships of maize inbreds ................................................ 8

2.1.1 SNP data for genetic analysis of maize relationships .................................................................. 8

2.2.2 Hierarchical clustering with maize inbreds .................................................................................. 9

2.2.3 Network diagram to group related individuals ............................................................................. 9

2.2.4 Identifying parental genomic blocks in maize populations ........................................................ 10

2.3 Grain yield improvements in maize .................................................................................................. 11

2.3.1 Yield improvements in the 20th century ..................................................................................... 11

2.3.2 Role of planting density in yield increases ................................................................................ 12

2.3.4 Traditional QTL mapping for grain yield in maize .................................................................... 12

2.4 In silico association mapping with maize breeding program data .................................................... 13

2.4.1 Advantages of in silico association mapping ............................................................................. 13

2.4.2 In silico QTL mapping with maize breeding program data ....................................................... 14

2.4.3 Mixed models for in silico mapping .......................................................................................... 15

2.5 QTL mapping with North Carolina Design II data ........................................................................... 17

2.5.1 Description of North Carolina Design II .................................................................................... 17

2.5.2 QTL mapping with NCII breeding program data ...................................................................... 18

2.6 Research objectives and hypotheses ................................................................................................. 18

Chapter 3: Applications of Genotyping-by-Sequencing data in Maize Breeding Programs ...................... 21

3.1 Abstract ............................................................................................................................................. 21

3.2 Introduction ....................................................................................................................................... 21

3.3 Methods............................................................................................................................................. 26

Page 5: Analysis of Genotyping-by-Sequencing data in a maize (Zea

v

3.3.1 Germplasm and marker data ...................................................................................................... 26

3.3.2 Network Diagram ....................................................................................................................... 30

3.3.3 Hierarchical clustering ............................................................................................................... 31

3.3.4 Identity by descent ..................................................................................................................... 33

3.4 Results and Discussion ..................................................................................................................... 34

3.4.1. Classifying Lines into heterotic patterns using GBS data ......................................................... 34

3.4.2. Identification of putative parents of lines of unknown parentage using GBS data ................... 39

3.4.3. Determining genome contribution of the parents from a breeding cross using GBS data ........ 48

3.5 Conclusions ....................................................................................................................................... 54

CHAPTER 4: IN SILICO MAPPING OF MAIZE GRAIN YIELD QTLS USING STRUCTURED

CROSSES ................................................................................................................................................... 55

4.1 Abstract ............................................................................................................................................. 55

4.2 Introduction ....................................................................................................................................... 56

4.3 Methods............................................................................................................................................. 59

4.3.1 Germplasm and Field Trials ....................................................................................................... 59

4.3.2 Molecular marker data ............................................................................................................... 61

4.3.3 Mixed Model Analysis ............................................................................................................... 62

4.4 Results and Discussion ..................................................................................................................... 63

4.4.1. Genomic relationship matrices .................................................................................................. 64

4.4.2 QTL detection using genomic and pedigree-based matrices ..................................................... 69

4.4.3 Informative SNPs used for in silico mapping ............................................................................ 74

4.4.4. QTL detection and NCII ........................................................................................................... 75

4.4.5. QTLs detected in Stiff Stalk ...................................................................................................... 76

4.4.6 Discussion of the mixed model approach .................................................................................. 78

4.6 Conclusions ....................................................................................................................................... 79

CHAPTER 5: GENERAL DISCUSSION .................................................................................................. 81

5.1 Applications of genotyping by sequencing data in maize breeding programs .................................. 81

5.2 In silico mapping of maize grain yield QTLs using structured crosses ............................................ 82

REFERENCES ........................................................................................................................................... 85

SUPPLEMENTAL TABLES AND FIGURES FOR CHAPTER 4: IN SILICO MAPPING OF MAIZE

GRAIN YIELD QTLS USING STRUCTURED CROSSES ..................................................................... 94

Page 6: Analysis of Genotyping-by-Sequencing data in a maize (Zea

vi

LIST OF FIGURES

Figure 3.1. Network diagram for 1,081 public and ex-PVP lines and 110 Guelph lines generated

using IBS values greater than 0.3 (A). Points sharing the same colour (red, green,

pink or dark blue) indicate Guelph inbred lines derived from the same pedigree.

Other Guelph lines are coloured light blue. Founder inbred lines of heterotic patterns

are labelled. The germplasm clusters into three main heterotic pools: Stiff Stalk

(B73, B37, B14), Iodent (PH207) and Lancaster (Mo17, Oh43, Wf9). Network

diagram with labels, zoomed in to the region containing B73 (circled) showing the

level of detail generated in the diagram (B).………………………...…....…..……..36

Figure 3.2. Visual representation of the main division of branches in the dendrogram. The

branches containing tropical, popcorn, and sweet corn are labelled. The remaining

branches of the dendrogram contain U.S. field corn and Guelph lines, with the

locations of key founder individuals indicated…………………………...…………37

Figure 3.3. Subset of the dendrogram showing closest neighbours of PHJ40 (A) and PHR25 (B),

located on branches that diverge in the third split of the dendrogram, when

University of Guelph lines are not included in the dataset. When progeny from

PHJ40 x PHR25 are included in the dataset, these lines cluster together, with their

progeny (C).………………………………………………………………………....38

Figure 3.4. Histogram of the proportion of SNP alleles unaccounted for by either parent when

inbred lines and parents are randomly sampled from the dataset (n=10,000).……...44

Figure 3.5. Clustering of five siblings places all five lines with PHJ40 as the putative parent (A).

Removing SNPs with identical alleles between the lines and PHJ40 places each line

with PHG72, showing CG64 as an example (B).…………………………...……....45

Figure 3.6. Clustering of five siblings places four of the siblings together with LH85 as the

putative parent (A). For these four siblings, removing SNPs with identical alleles to

LH85 places each line with PHJ40, showing CG70 as an example (B). The fifth

sibling, CG42, clusters elsewhere (Figure 3.6)..………………………………...…..46

Figure 3.7. Initial clustering places CG42 with Tx714 and N201 (A). Removing SNPs with

identical alleles to N201 (B) or Tx714 (C) places CG42 with NC278 as the second

putative parent...……………………………………………………………………..46

Figure 3.8. Clustering of five siblings did not reveal a clear putative parent. When each sibling

was clustered individually, two of the siblings, CG57 (A) and CG65 (B), clustered

with SD102, with differing surrounding ex-PVP lines. The other three siblings,

CG37 (C), CG38 (D) and CG79 (E), clustered with the same set of ex-PVP lines,

which differs from the clustering in (A) and (B)..……………………...…….……..47

Page 7: Analysis of Genotyping-by-Sequencing data in a maize (Zea

vii

Figure 3.9. Parental linkage block distribution across the seven inbreds derived from the two-

way breeding cross, PHJ40 (Stiff Stalk, blue) and PHR25 (Iodent, red). Of the SNP

alleles that could be assigned to a parent, the percentage of SNP alleles inherited

from the Iodent parent is shown...…………………………………………….……..49

Figure 3.10. Parental linkage block distribution across the three inbreds derived from the three-

way breeding cross. Linkage blocks in blue represent were inherited from CG102,

red from CG33, and green from CG65...…………………………………….……...50

Figure 3.11. The expected and observed percentage of parental contribution in the 3-way cross

derived inbred lines based on SNP alleles that could be assigned to a parent. Parental

lines: CG102 (blue), CG65 (green) and CG33 (red)..……………………..…….….50

Figure 3.12. Histogram of sizes of parental linkage blocks, with the frequency of blocks in each

bin expressed as a percentage of the total number of linkage blocks for the inbred

line.………………………………...……………………………………....………..52

Figure 3.13. Ideograms of SNP alleles shared by all seven siblings from the Stiff Stalk x Iodent

cross (A) and all three siblings in the 3-way cross (B). Regions in blue represent

PHJ40 and red represent PHR25 (A); regions in blue represent CG102, red CG33,

and green CG65 (B). Regions of dense SNP markers are indicated by green boxes

(A) ..………………………………...……………………………………..………..53

Figure 4.1. Scatter plot of Likelihood Ratio Test statistic (LRT) values for Stiff Stalk markers for

additive effects for grain yield using the G matrix (top) and pedigree matrix

(bottom). Lines indicate FDR adjusted q-value < 0.05...…………………..…….….70

Figure 4.2. Ideogram illustrating the genome coverage of the SNP markers used for in silico

mapping. Black bands show the position of (A) 27,398 SNPs in Stiff Stalk inbred

lines and (B) 32,401 SNPs in Iodent inbred lines...…………………………...……..75

Figure 4.3. Ideogram showing the chromosomal locations of the SNPs that are significantly

associated with regions influencing grain yield in the Stiff Stalk inbred lines..…….77

Page 8: Analysis of Genotyping-by-Sequencing data in a maize (Zea

viii

LIST OF TABLES

Table 3.1. Pedigrees and families of Guelph inbred lines and Ex-PVP lines used to generate GBS

data. Pedigree information from E. Lee (unpublished data) and U.S. plant variety

protection certificates………………………………………………………..……..…26

Table 3.2. The percentage of SNPs unaccounted for by either parent for inbred lines of known

parentage…………………………………………………………………………..…43

Table 3.3. Number and size of parental linkage blocks identified in 10 University of Guelph

inbred lines derived from breeding crosses.………….…………………..………..…51

Table 3.4. IBD regions between the seven siblings of the Stiff Stalk x Iodent cross. These region

of dense marker coverage were selected based on visual inspection of the ideogram

(Figure 3.11). ………………………………………………………………....…..…53

Table 4.1. Pedigree of Stiff Stalk and Iodent inbred lines used in the NCII. Pedigree information

from E.A. Lee (Unpublished data) and United States plant variety protection

certificates. …………………………………………………………………….…..…60

Table 4.2. GBS-based relationship matrix for the Stiff Stalk inbred lines. …………………..…66

Table 4.3. GBS-based relationship matrix for the Iodent inbred lines…………………………..66

Table 4.4. Pedigree-based relationship matrix for the Stiff Stalk inbred lines. …………..……..67

Table 4.5. Pedigree-based relationship matrix for the Iodent inbred lines…………………..…..67

Table 4.6. The levels of LRT values observed in the scatter plot (Figure 2) correspond to

different patterns of alleles across the inbred lines. Level 1 indicates the highest LRT

value. The numbers for the alleles indicate the count of minor alleles (i.e. 0 =

homozygous major allele, 1 = heterozygous, 2 = homozygous minor allele)…..…....71

Table 4.7. Mapping with the pedigree-based matrix results more markers with significant

associations than mapping with the genomic relationship matrix. Likelihood ratio test

(LRT) statistics and corresponding adjusted p-value (q-value) for SNPs in the different

LRT levels are shown. Q-values < 0.05 are indicated with a *.…………………..….71

Table 4.8. For markers in the top three LRT levels, homozygosity for the major allele is

associated with high grain yield at commercial (74 k ha-1) and high (148 k ha-1)

population densities. The inbred lines are ordered according to the number of minor

alleles at loci in the top 3 LRT levels (see Table 4.6). Estimates of grain yield,

expressed as gi, are shown for each inbred line for each planting density (Holtrop

2016). These estimates reflect the difference between the average yield of all progeny

of a parental line and the average yield of all hybrids grown in an environment…….72

Page 9: Analysis of Genotyping-by-Sequencing data in a maize (Zea

ix

LIST OF ABBREVIATIONS USED

ANOVA…………………………………………………………………….Analysis of Variance

Ex-PVP…………………………………………………………expired Plant Variety Protection

GBS………………………………………………………….………..genotyping-by-sequencing

GS……………………………………………………………………………….genomic selection

IBD………………………………………………….…..identical-by-descent/ identity-by-descent

IBS………………………………………………………..….. identical-by-state/ identity-by-state

LD…………………………………………………………………………..linkage disequilibrium

LogL…………………………………………….…………………………………..Log likelihood

MAS……………………………………………………………………..marker assisted selection

NCII…………………………………………………………………….. North Carolina Design II

PCA……………………………………………………………….…principal component analysis

PCOA………………………………………………………………...principal coordinate analysis

PVP……………………………………………………………………….Plant Variety Protection

QTL…………………..……………………………………………………..quantitative trait locus

REML……………………………....………………………...……restricted maximum likelihood

RFLP………………………………………………….restriction fragment length polymorphisms

RIL………………………………………………………………………...recombinant inbred line

SNP………………………………………………..……………...single nucleotide polymorphism

SSR………………………………………………………………………..simple sequence repeats

UPGMA………………….…………………unweighted pair group method with arithmetic mean

Page 10: Analysis of Genotyping-by-Sequencing data in a maize (Zea

1

CHAPTER 1: INTRODUCTION

Maize (Zea mays L.) is a globally important crop used for food, animal feed, ethanol

production and many industrial purposes. With over 1 billion tonnes produced in 2013, maize

has the second highest global production of any crop, following sugar cane (FAO 2015). A

method of assessing a large number of SNP markers at low cost per sample, called genotyping-

by-sequencing (GBS), has been recently developed (Elshire et al. 2011). This method generates

large amounts of data per sample, and data for thousands of maize lines is publically available.

The clustering of maize lines into heterotic groups is a common technique used to understand the

germplasm structure within maize breeding programs. Recently, a new method of clustering,

called a network diagram, has been described (Romay et al. 2013). A comparison of this new

approach to the more common method of clustering, hierarchical clustering, is needed to

determine the merits of each method. Additionally, the development of novel applications of

GBS data to maize breeding programs are needed, to increase the potential use of this data.

This thesis demonstrates methods of using publically available GBS data in conjunction

with data from the University of Guelph maize breeding program. This thesis compares the

outcomes of clustering commercial maize germplasm, using the network diagram and

hierarchical clustering approaches, using GBS data for nearly 1,200 maize lines. This

comparison allows researchers to better understand the differences of the approaches, allowing

them to select the best method for their experiment. Then, this thesis describes novel applications

of GBS data, including identifying close relatives of maize lines from a dataset, and visualizing

the inheritance of parental linkage blocks along the chromosomes. This study presents the

comparison of current methods, and describes novel applications of GBS data, which facilitates a

greater understanding of maize germplasm as well as benefiting breeding programs.

Page 11: Analysis of Genotyping-by-Sequencing data in a maize (Zea

2

The high marker density of GBS data also has applications in quantitative trait loci (QTL)

mapping. Maize breeders are continually seeking new QTLs for desirable agronomic traits, as

marker data is now an essential component of maize breeding programs (Bernardo and Yu

2007). There is now an opportunity to develop new methods, using the high marker density of

GBS data, to increase the efficiency of QTL discovery and the application of results to breeding

programs. While traditional QTL mapping is performed using linkage mapping, with a

segregating mapping population generated from a bi-parental cross, these results are not always

applicable to elite germplasm. An alternative method is association mapping, which uses

historical linkage disequilibrium to link genomic regions to phenotypes (Zu et al. 2008).

Association mapping can be performed using existing phenotypic data from elite germplasm to

detect QTLs through computational methods, referred to as in silico mapping (Grupe et al.

2001). Using elite germplasm to detect QTLs bridges the gap between research and application,

allowing results to be applied directly using marker assisted selection (Parisseaux and Bernardo

2004; Crepieux et al. 2005). Additionally, using existing data from breeding programs has

reduced cost compared to generating a traditional QTL mapping experiment (Parisseaux and

Bernardo 2004).

Previous studies have detected QTLs with in silico association mapping using phenotypic

data from hybrids in a breeding program (Parisseaux and Bernardo 2004; van Eeuwijk et al.

2010). However, these studies generated hybrids by crossing individuals from different heterotic

pools in various combinations, rather than systematically generating hybrids. This thesis

examines the potential of utilizing a structured mating design, North Carolina Design II (NCII)

(Comstock and Robinson 1952), and GBS data for in silico association mapping. The NCII is a

commonly used mating design, in which each member of a group of females is crossed with each

Page 12: Analysis of Genotyping-by-Sequencing data in a maize (Zea

3

member of another group males to identify superior inbred lines and parental combinations. The

use of a systemic design facilitates the partitioning of the genotypic variance into effects due to

the female, effects due to the male, and effects due to the interaction of the male and female

using a two-way analysis of variance (ANOVA) (Hallauer et al. 2010a). This thesis explores the

use of NCII ANOVA to serve as a preliminary step to in silico QTL mapping, by ensuring that

sufficient additive genetic variation is present in the germplasm for the phenotype of interest.

This research describes a method of in silico association mapping using GBS data and existing

hybrid data for grain yield from an NCII, consisting of elite short season Stiff Stalk and Iodent

inbred lines. This research also investigates the potential of using GBS data to generate a

genomic relationship matrix, to substitute for the traditionally used pedigree-based additive

relationship matrix. This method of in silico QTL mapping using GBS data allows breeders to

increase the efficiency of their breeding programs by integrating QTL mapping with existing

phenotypic hybrid data from structured crosses, leading to directly applicable marker

associations to use in selection decisions.

Page 13: Analysis of Genotyping-by-Sequencing data in a maize (Zea

4

CHAPTER 2: LITERATURE REVIEW AND RESEARCH OBJECTIVES

2.1 HETEROTIC PATTERNS IN MAIZE BREEDING

2.1.1 Development of hybrids from open pollinated populations

Maize was domesticated in tropical southern Mexico 7,000 to 8,000 years ago and has

since been cultivated along the North-South axis, adapting to more temperate climates (Troyer

1999). Following European colonization of the United States, there were two main groups of

open pollinated corn. Dent corn was grown in the south and had high yields due to breeding

efforts. Flint corn was grown in the north, was adapted to shorter growing seasons at northern

latitudes and had lower yields than dent corn. Breeders have developed higher yielding varieties

by incorporating traits such as early flowering and cold tolerance from flint corn into higher

yielding dent corn germplasm.

In the United States in 1900, there were an estimated 1,000 open pollinated cultivars that

had been generated from the crossing of Flint x Dent corn groups (Montgomery 1916; Troyer

1999). In open pollinated populations, the crosses were not controlled, but rather the seed

produced from desirable female plants was selected. The concept of hybrid maize was developed

by Shull (1908, 1909) and E.M. East (1909), with the development of inbred lines and hybrids

beginning in the 1920s. Hybrid maize had the advantage of both male and female parent

selection as well as the phenomenon of heterosis, in which the progeny of two unrelated parents

shows superior performance over the parents for biological traits such as stress tolerance and

yield. The initial inbred lines were derived from the self-pollination of virtually all of the 1,000

open pollinated varieties (Troyer 1999; Duvick 2005).

The hybrids developed in the 1920s were double cross hybrids, produced by crossing four

inbred parents, because single cross hybrids did not produce enough seed (Jones 1918). By the

1930s double-cross hybrid performance had exceeded that of open pollinated cultivars and

Page 14: Analysis of Genotyping-by-Sequencing data in a maize (Zea

5

double-cross hybrids were widely sold in place of open pollinated populations (Crow 1998).

Over the next 30 years, breeding efforts resulted in increased grain yield of inbred lines, which

increased the yields of single cross hybrids (Troyer 1999). The commercialization of single cross

hybrids in the 1960s replaced double cross hybrids (Crow 1998).

2.1.2 Privatization of U.S. maize breeding

Since the 1970s, breeding has become increasingly privatized and the role of public

inbreds in commercial hybrids has continually declined (Darrah and Zuber 1986; Mikel and

Dudley 2006). Private companies began their breeding programs by utilizing the same public

germplasm in the 1920s and 1930s, and subsequently developing new inbred lines by self-

pollinating superior, commercial hybrids from their private breeding programs (Troyer 1999).

Over time, germplasm has been exchanged between breeding programs through the self-

pollination of competitors’ commercial hybrids (Troyer and Mikel 2010). Maize hybrids in the

U.S. are currently produced from propriety inbred lines protected by the U.S. Plant Variety

Protection Act (PVP), which was passed in 1970 (Mikel and Dudley 2006; Nelson et al. 2008)

and through plant patents. The act excludes others from using the protected inbred for 18 years,

or 20 years if registered after 1994. The practice of deriving inbred lines from competitors’

commercial hybrids stopped with the patenting of hybrids. Lines with expired PVP (ex-PVP) can

be used in both private and public breeding programs and for research purposes to gain insight

about the heterotic patterns used in proprietary germplasm.

2.1.3 The role of heterotic patterns in maize breeding

The breeding of field corn utilizes heterotic patterns, which function to make heterosis

predictable, when lines from different heterotic patterns are crossed. These heterotic patterns are

typically maintained by keeping the populations of inbred lines genetically distinct from each

Page 15: Analysis of Genotyping-by-Sequencing data in a maize (Zea

6

other (Bernardo 2001). Hybrids derived from lines from different heterotic patterns generally

perform better than hybrids created from lines from the same heterotic pattern. The molecular

basis of heterosis is not yet understood. The two main theories include “dominance”, in which

there is a complementation between parent alleles which results in the masking of deleterious

alleles and the accumulation of favourable dominant alleles, and “over dominance”, in which

heterozygous combinations of alleles are superior to homozygous combinations of alleles

(Birchler et al. 2003).While neither of these theories can fully explain this phenomenon,

heterosis has been utilized in maize breeding for nearly a century.

New inbred lines are usually generated within a heterotic group by recycling elite lines,

which maintains genetic dissimilarity between the heterotic groups but also limits diversity

within the heterotic group (Mikel 2008). To increase genetic diversity in a breeding program,

breeders occasionally use commercial hybrids to generate new inbreds, which utilizes an inter-

heterotic cross and disrupts the heterotic patterns (Bernardo 2001). The success of developing

inbreds from an inter-heterotic cross depends on the selection of a suitable tester (Bernardo

2001). The effects of using an inter-heterotic cross for inbred development are not well

understood.

2.1.4 U.S. heterotic patterns and founding inbred lines

Modern heterotic patterns of North American germplasm can be difficult to discern,

considering the privatization of modern hybrid seed production. However, most elite inbred lines

in a heterotic group can be traced back to a few key public inbred lines, so heterotic patterns can

be described based on these key founder individuals (Darrah and Zuber 1986; Nelson et al.

2008).

Page 16: Analysis of Genotyping-by-Sequencing data in a maize (Zea

7

U.S. maize heterotic patterns fit into two overarching groups, reflecting the major

division between heterotic patterns: Iowa Stiff Stalk Synthetic (Stiff Stalk) and non-Stiff Stalk

(Duvick 2005; Mikel and Dudley 2006; Mikel 2008). Stiff Stalk lines have been largely

influenced by the public inbred B73, and can also be traced back to founder lines B14 and B37

(Mikel and Dudley 2006; Mikel 2008).

The groups detected within non-Stiff Stalk germplasm depend largely on the material

analyzed. Mikel and Dudley (2006) used U.S. PVP records from 1980 to 2004 to group 685

proprietary lines into seven main groups: Stiff Stalk, Oh43, Lancaster, Oh07-Midland, Iodent,

Commercial hybrid derived, and Argentine Maiz Amargo. Mikel (2008) used PVP records for 55

lines and identified the following non-Stiff Stalk groups: Oh43/miscellaneous, Mo17/LH123,

LH82, Oh07/Midland Yellow Dent/Pioneer Female Composite and Iodent. The Iodent group can

be traced to the founder proprietary line PH207 (Troyer 1999; Mikel and Dudley 2006). The

Lancaster heterotic pattern can be traced back to the public line Mo17 and the private line LH51,

which was derived from Mo17 (Mikel and Dudley 2006). Another non-Stiff Stalk pattern is Wf9,

“Wilson Farm row 9”, which was a popular inbred line for hybrid development for 30 years after

its development in 1936 (Troyer et al. 1999).

While heterotic patterns can be divided into Stiff Stalk and non-Stiff stalk groups,

heterotic patterns have also been described as three main groups. Mikel and Dudley (2006)

report that the main groups in the 1950s U.S. germplasm were Stiff Stalk, non-Stiff Stalk and

Iodent. Darrah and Zuber (1986) outline the main groups in 1984 germplasm as Lancaster

(Mo17), Reid (B73, B37) and Iodent (I205 and private lines). The division of heterotic groups

and the assigning of lines to heterotic groups can be subjective and is largely dependent on the

germplasm analyzed.

Page 17: Analysis of Genotyping-by-Sequencing data in a maize (Zea

8

2.2 USE OF GENETIC MARKERS TO INVESTIGATE RELATIONSHIPS OF MAIZE

INBREDS

2.1.1 SNP data for genetic analysis of maize relationships

While pedigree information is commonly used to determine population structure,

pedigree data is often not available or may be limited. Instead, molecular markers can be used to

analyze population structure and the relationships between inbred lines (Nelson et al. 2008).

Several types of markers have been used for genetic analysis of maize populations, including

restriction fragment length polymorphisms (RFLP) (Mumm et al. 1994), simple sequence repeats

(SSR) (Barata and Carena 2006), and single nucleotide polymorphisms (SNPs) (Nelson et al.

2011). SSR markers are multi-allelic and therefore more informative than bi-allelic SNPs.

Nelson et al. (2011) demonstrated that SNP and SSR markers can generate consistent clustering

of inbred lines when using a sufficient number of SNPs, in their case 305 SNPs compared to 150

SSR markers.

A procedure for high density SNP discovery, called Genotyping-by-Sequencing (GBS),

has been described by Elshire et al. (2011). This approach is feasible for high diversity, large

genome species such as maize, and can assess a large number of markers at low cost. This

method uses restriction enzyme digestion to reduce genome complexity by targeting lower copy

regions of the genome and avoiding repetitive regions. In this protocol, DNA is digested with the

restriction enzyme ApeKI, ligated to adapters, amplified, and then sequenced with Next-

Generation-Sequencing technology. The reads can then be aligned to the maize B73 reference

genome. Data from GBS has recently been used for genome-wide association analysis (GWAS)

and the investigation of population structure in maize (Romay et al. 2013).

Page 18: Analysis of Genotyping-by-Sequencing data in a maize (Zea

9

2.2.2 Hierarchical clustering with maize inbreds

A common method of population structure analysis is hierarchical clustering, which has

been performed using several types of markers, beginning with RFLPs (Mumm et al. 1994), then

SSR (Barata and Carena 2006), and SNPs (Nelson et al. 2011; Wu et al. 2016). Within U.S.

germplasm, clustering with SNP data separates lines according to their heterotic pattern.

Clustering with 114 lines including ex-PVP and public inbreds using 639 SNP markers with the

unweighted pair group method with arithmetic mean (UPGMA) method resulted in six clusters,

with the main division in the dendrogram separating the Stiff Stalk groups (B73, B37, A632)

from non-Stiff Stalk groups (Mo17, Oh43, PH207) (Nelson et al. 2008). The six predominate

clusters are described by key founder lines in each cluster. Clustering using pair-wise distances

between 21 public inbreds, including exotic material, using 351,710 SNPs resulted in three

clusters: exotic, Stiff Stalk and non-Stiff Stalk (Hansey et al. 2012).

Hierarchical clustering with SNP data has been shown to separate U.S. field germplasm

from other material. Clustering with the UPGMA method using 511 SNP markers separated 689

inbreds into clusters of sweet corn, popcorn and tropical germplasm that were distinct from the

U.S. germplasm clusters (Hansey et al. 2011). However, this clustering analysis did not produce

clear divisions of heterotic patterns within U.S. field germplasm, suggesting the generating

clusters of heterotic patterns may be easier with smaller sample sizes or may be dependent on the

U.S. germplasm used in the analysis.

2.2.3 Network diagram to group related individuals

A new approach of grouping individuals, called the network diagram, has been recently

described by Romay et al. (2013). In this method, an Identity-by-State (IBS) similarity matrix

was created for 212 Ex-PVP lines using 620,279 SNPs with the software application PLINK

Page 19: Analysis of Genotyping-by-Sequencing data in a maize (Zea

10

v1.07 (Purcell et al. 2007). Using the similarity matrix IBS values greater than 0.9, an algorithm

in the software Gephi dispersed the lines in 2-dimensional space based on the simultaneous

attraction of similar points and the repulsion of dissimilar points (Bastian et al. 2009). This

method does not place any restrictions on the number of clusters that are generated in the

diagram. The network diagram created by Romay et al. (2013) generated three main clusters of

germplasm, Stiff Stalk (B73), non-Stiff Stalk (Mo17/Oh43), and Iodent (PH207), which reflect

the heterotic patterns of the analyzed germplasm.

2.2.4 Identifying parental genomic blocks in maize populations

In breeding programs, breeders aim to select, in the progeny, the genomic segments

underlying the desirable traits of the parents. High density genotypic markers are ideal for

identifying signatures of selection, genomic regions that have been subjected to selection

pressure and likely contain genes underlying biologically important traits (Cadzow et al. 2014).

When a genomic region in two individuals contains alleles that were are inherited from a

common ancestor, this shared genomic segment is said to be identical-by-descent (IBD). High

density SNP data has been used to identify IBD regions in small populations of descendants from

three founder lines in Chinese maize breeding: Dan340, Huangzao4 and Mo17 (Liu et al. 2015).

The population sizes were 13, 20 and 23 inbreds for Dan30, Huangzao4 and Mo17, respectively.

In each population, the lines were genotyped for 40k SNPs, and genomic regions common

among all descendants, that also matched the founder parent, were identified as IBD regions,

using a minimum block size of 2 SNPs. Over a thousand IBD regions were detected in each

population: 1,262 in Dan340, 1,373 in Mo17, and 1,019 in Huangzao4. The maximum length of

IBD segment detected was 4.4 Mbp and contained 25 SNPs. These identified IBD regions may

be due to selection pressure, or may be due to genetic drift in the breeding population.

Page 20: Analysis of Genotyping-by-Sequencing data in a maize (Zea

11

2.3 GRAIN YIELD IMPROVEMENTS IN MAIZE

2.3.1 Yield improvements in the 20th century

One of the main targets of maize breeding programs has been increased grain yield,

which is affected by both biomass accumulation as well as the partitioning of above ground

biomass to the grain (Lee and Tollenaar 2007). Selection for grain yield in U.S. commercial

populations has led to a six fold increase in yield from 1939 (1,300 kg/ha) to 2005 (7,800 kg/ha)

as well as enhanced abiotic stress tolerance (Lee and Tollenaar 2007). While the development of

double-cross and single-cross hybrids, introduced in the 1930’s and 1960’s, respectively, has

contributed to increased corn yields, the extent of heterosis has not changed over the years,

suggesting that yield improvement since the development of hybrids can be attributed to the

improvement of the inbred lines (Troyer 1999; Duvick 2001).

The improvement of corn yields since the 1930’s has been due to both plant breeding and

improved management, with 40-50% of gains attributed to management (Duvick 2005).

Management changes include: farming machinery, nitrogen fertilizer, herbicides, pest control

and higher density planting (Troyer 1999; Duvick 2005). Improvements to farming practices are

plateauing and future increases may be more dependent on genetic gain than changes in

management (Duvick 2005). The remaining 50-60% of yield gains is attributed to genetic

improvements, including improved efficiency of grain production (such as reduced branching

and weight of tassels and more upright leaves) and improved tolerance to biotic and abiotic stress

(such as heat and drought tolerance) (Duvick 2005). Breeding has also improved tolerance to

weed interference, dense planting, low soil nitrogen, low soil moisture, disease, and drought, as

well as adaptation to wider climatic regions (Troyer 1999; Duvick 2001; Tollenaar and Lee

2002). Most of the yield improvement is due to selection of genotypes that are responsive to new

Page 21: Analysis of Genotyping-by-Sequencing data in a maize (Zea

12

management practices (Tollenaar and Lee 2002). Specifically, maize has been bred for

responsiveness to high nitrogen fertilizer applications and tolerance to high density planting

(Troyer 1999).

2.3.2 Role of planting density in yield increases

Increases in grain yield per hectare in modern maize is not due to more grain per plant

but is instead due to increased tolerance of hybrids to higher plant densities, allowing more

plants per hectare (Duvick 2005; Guo et al. 2011). The density of maize planting has increased

continually since the 1950’s (Duvick 2005). In the 1930’s, plant density averaged 30,000 plants/

ha and this rose to 40,000 plants/ ha in the 1960s, 60,000 plants/ ha in the 1980s, and 80,000

plants/ ha in the mid-2000s (Duvick 2005). Abiotic stress is heightened in high density

environments because resources such as solar radiation, soil nutrients and soil moisture are

limiting (Tollenaar and Lee 2002). Tolerance to high density planting can be attributed to

increased abiotic stress tolerance, since the plants are subject to chronic abiotic stress in high

density environments (Duvick 2005).

2.3.4 Traditional QTL mapping for grain yield in maize

Yield improvement efforts now incorporate genotypic data as an essential component of

plant breeding (Bernardo and Yu 2007). With the decreasing costs of markers, genotypic data

can now play a large role in breeding through marker assisted selection (MAS) and breeding

value prediction using genome-wide selection. Marker data is especially useful for traits with

low heritability such as maize yield (Moose and Mumm 2008). Quantitative trait loci (QTLs) are

regions of the genome linked to or containing some of the genes underlying a quantitative trait,

which is influenced by many genes and environmental factors.

Page 22: Analysis of Genotyping-by-Sequencing data in a maize (Zea

13

Traditional QTL mapping for grain yield has been performed using linkage mapping,

which utilizes a bi-parental cross to generate a mapping population. The recombination events

that occur from the bi-parental cross and subsequent inbreeding of the F2 lines creates linkage

disequilibrium between the markers and QTLs in this mapping population, facilitating QTL

detection. Previous studies have identified maize grain yield QTLs using linkage mapping with

populations composed of: F3 derived from segregating F2 population (Ribaut et al. 1997), F2:3

families derived from F2 individuals (Malosetti et al. 2008), and F2:3 and F6:7 recombinant

inbred lines (RILs) from a bi-parental cross (Austin and Lee 1996). Grain yield has also been

mapped using testcross populations from backcrossed families crossed with a tester (Ho et al.

2002), F3 lines derived from segregating F2 crossed to two testers (Melchinger et al. 1998), and

F5 lines derived from F2 crossed to a tester (Boer et al. 2007). These identified QTLs provide

insight into the genetic basis of maize yield and the loci underlying yield differences between

maize lines.

2.4 IN SILICO ASSOCIATION MAPPING WITH MAIZE BREEDING PROGRAM

DATA

2.4.1 Advantages of in silico association mapping

An alternate approach to linkage mapping is association mapping, also referred to as

linkage disequilibrium mapping or genome wide association study (GWAS). This approach

utilizes historic linkage disequilibrium, arising from historical recombination events, in a

population of related individuals (Zu et al. 2008). The use of existing phenotypic and genetic

data to detect QTLs is referred to as in silico mapping (Grupe et al. 2001). This approach uses

computational methods to identify QTLs by using the genetic variation and linkage

Page 23: Analysis of Genotyping-by-Sequencing data in a maize (Zea

14

disequilibrium in the germplasm analyzed. This linkage disequilibrium is present in the entire

population analyzed, rather than only being present in an experimentally generated population.

In silico association mapping can be performed using elite germplasm, which integrates

QTL discovery with crop breeding. This has a number of advantages over traditional linkage

mapping: (1) As the set of materials that the QTLs are detected in is elite germplasm, the

information can be used directly for marker assisted selection in a breeding program (Parisseaux

and Bernardo 2004; Crepieux et al. 2005); (2) Phenotypes can be generated with environmental

replicates, reducing environmental effects (Zhang et al. 2005); (3) Using existing phenotypic

data has reduced cost compared to generating and assessing phenotypes for a large mapping

population (Parisseaux and Bernardo 2004); and (4) A dataset of elite germplasm has greater

potential for QTL discovery because the parents of the bi-parental cross are likely to be

monomorphic for some markers and will have limited allelic diversity compared to the

population as a whole (Parisseaux and Bernardo 2004; van Eeuwijk et al. 2010; Bink et al.

2012).

2.4.2 In silico QTL mapping with maize breeding program data

This in silico mapping approach can either be applied to an association mapping panel or

to hybrid data from a maize breeding program. An association mapping panel is a collection of

elite inbred maize lines that are genotyped and phenotyped to use in GWAS. In silico QTL

mapping using inbred lines from maize breeding programs has been demonstrated by Romay et

al. (2013) and Zhang et al. (2005) for traits related to flowering time. Alternatively, mapping can

use phenotypic data for maize hybrids, generated in a breeding program, and genotypic data for

parental inbred lines. This approach detects QTLs in the parental inbred lines, with QTLs

detected in each heterotic pattern separately. This approach, as compared to an association

Page 24: Analysis of Genotyping-by-Sequencing data in a maize (Zea

15

mapping panel, has several advantages: (1) The use of hybrid data can capture the heterotic

phenotype of traits such as grain yield and plant height, which cannot be achieved using

phenotypic data of parental lines; (2) This approach allows for detection of QTLs unique to a

heterotic pattern, as well as allows for cross-validation of QTLs between heterotic patterns

(Parisseaux and Bernardo 2004); and (3) Hybrid trials in breeding programs are conducted in

environments to which the hybrid is adapted, ensuring that detected QTLs are relevant to

environments used for crop production.

Previous studies have reported in silico QTL mapping using propriety maize breeding

program hybrid data. Parisseaux and Bernardo (2004) detected QTLs for grain moisture, plant

height and smut resistance using 96 SSR markers and 22,774 hybrids from the Limagrain

genetics program (France). The hybrids were generated from 1,266 inbred lines using nine

combinations of the nine heterotic groups. van Eeuwijk et al. (2010) detected QTLs for ear

height, plant height and yield using 769 SNPs and hybrid phenotypic data from Pioneer Hi-bred

International (Johnston, IA). The germplasm included 1,700 hybrids generated by crossing lines

from two heterotic groups. These studies generated hybrids by crossing individuals from

different heterotic pools in various combinations, rather than systematically generating hybrids.

2.4.3 Mixed models for in silico mapping

In silico mapping utilizes a mixed model that is fit to the data and then used to identify

loci significantly associated with the phenotype of interest. The general mixed model is as

follows: phenotypic observations = fixed effects (for environment and replicates) + additive

effects associated with marker alleles in heterotic group 1 parents + additive effects associated

with marker alleles in heterotic group 2 parents + additive effects not associated with markers of

heterotic group 1 parents (polygenic effects) + polygenic effects for heterotic group 2 parents +

Page 25: Analysis of Genotyping-by-Sequencing data in a maize (Zea

16

residual effects (error). The two papers describing in silico mapping with hybrid phenotypic data

used different mapping approaches: association mapping (Parisseaux and Bernardo 2004) and

interval mapping (van Eeuwijk et al. 2010). The association mapping approach tests each marker

for linkage disequilibrium with a QTL. In the mixed model, the markers were treated as fixed

effects and the model parameters were estimated using the Restricted Maximum Likelihood

(REML) approach. The differences of marker allele effects were computed and tested for

significance using z-tests, for each of the heterotic patterns in the nine crosses used, for a total of

18 opportunities to detect a QTL. A marker loci was considered significant if it had at minimum

one significant z-test.

The interval mapping approach, first described by animal breeders (George et al. 2000),

was used by van Eeuwijk et al. (2010). This approach tests for QTLs at positions between

markers, at chosen intervals along the genome. For each interval, a genomic relationship matrix

was calculated using marker and pedigree data. In the mixed model, the marker effect was

treated as random and the variance/covariance structure was defined by the genomic relationship

matrix. The effects of the parameters in the model were then estimated using REML. For each

interval position, the likelihood of the model with a QTL effect was then compared to the

likelihood of a model with no QTL effect using the Likelihood Ratio Test statistic.

Association mapping is simpler and less computationally demanding than interval

mapping (Tanksley 1993). However, interval mapping is superior to association mapping for

estimating QTL locations and effects when there is low marker coverage of the genome, since

association mapping requires linkage disequilibrium between the potential QTL and a marker

(Tanksley 1993). The power of the association mapping approach has been assessed using

simulations of maize hybrid data (Yu et al. 2005). This research discovered that higher power for

Page 26: Analysis of Genotyping-by-Sequencing data in a maize (Zea

17

QTL detection is achieved with a large sample size (2,400 hybrids vs 600), high marker density

(400 vs 200 markers), high heritability (0.7 vs 0.4) and a small number of QTLs underlying the

trait (20 vs 80). With dense marker coverage (markers < 15 cM apart), results are comparable

between interval and association mapping (Stuber et al. 1992; Tanksley 1993; Zhu et al. 2008).

With GBS data, the number of markers that can be generated is much larger than the number

used in these previous studies (Elshire et al. 2011). The dense marker coverage of GBS data has

already been used for in silico GWAS mapping of flowering time in maize (Romay et al. 2013).

This dense marker data has not yet been applied to hybrid data from a maize breeding program.

2.5 QTL MAPPING WITH NORTH CAROLINA DESIGN II DATA

2.5.1 Description of North Carolina Design II

North Carolina Design II (NCII) is a mating design commonly used in maize breeding,

and can be used with any crop making use of heterosis (Comstock and Robinson 1952). This

mating design is used to assess superior parental inbreds through general combining ability and

superior parental combinations through specific combining ability. The NCII, also called a

factorial mating design, was originally developed for livestock, but it has become routinely used

in maize. The NCII is systematic, in which each member of a group of females is crossed with

each member of another group males. When used in maize, the group of females belongs to one

heterotic pattern and the group of males belongs to another heterotic pattern. This design can be

used to partition the genotypic variance into effects due to the males, females and their

interaction using a two-way ANOVA (Hallauer et al. 2010a). The genetic effects due to males

and females represent additive genetic effects, while the genetic effects due to the interaction

represent non-additive genetic effects (Rojas and Sprague 1952).

Page 27: Analysis of Genotyping-by-Sequencing data in a maize (Zea

18

2.5.2 QTL mapping with NCII breeding program data

The NCII is an ideal mating design for the integration of QTL mapping with existing

maize hybrid data. This mating design in popular in maize and is an efficient method to assess

whether there is sufficient additive genetic effects in each heterotic groups to warrant a QTL

mapping experiment with the data. While mapping with NCII data has not yet been described, an

association mapping study did detect QTLs for seed oil content in rapeseed using a partial NCII

design with 205 SSR markers assessed in the parental lines (Bu et al. 2015). This mixed model

included only marker data, and did not include terms for the relationship between lines or

environmental conditions. Further research is needed to develop methods of integrating QTL

mapping with NCII data from plant breeding programs, utilizing high density markers.

2.6 RESEARCH OBJECTIVES AND HYPOTHESES

This thesis project utilized data from several sources. Firstly, 16 Ex-PVP lines and 126

inbred lines from the Guelph breeding program, were assessed at 955,690 SNPs through GBS by

the Genomic Diversity Institute at Cornell. These lines belong to Stiff Stalk, Iodent and

Lancaster heterotic patterns. Secondly, publically available GBS data for 17,280 lines, as well as

the pedigree information for these lines, were obtained from panzea.org. This dataset includes

U.S. public germplasm, U.S. ex-PVP lines, tropical germplasm, popcorn, and sweet corn. These

two datasets were combined into one GBS dataset. Thirdly, the QTL mapping component of this

research utilized phenotypic data for grain yield from a NCII experiment at the University of

Guelph. This NCII crossed 11 Stiff Stalk lines to 10 Iodent lines. These hybrids were assessed

for grain yield at three locations near Guelph over three years. The parental lines used in the

NCII are included in the GBS dataset described above.

Page 28: Analysis of Genotyping-by-Sequencing data in a maize (Zea

19

Objective 1: To investigate population structure using a network diagram and hierarchical

clustering.

o Hypothesis 1. The network diagram groups lines into three main nodes, reflecting three

main heterotic patterns: Stiff Stalk, Lancaster/non-Stiff Stalk and Iodent.

o Hypothesis 2. Hierarchical clustering separates U.S. Ex-PVP field germplasm from

tropical, popcorn and sweet corn germplasm, but does not generate clearly defined

heterotic groups within field corn.

Objective 2: To identify Ex-PVP relatives of Guelph lines using hierarchical clustering.

o Hypothesis 1. Guelph lines cluster with the Ex-PVP line from the public dataset with

greatest genetic similarity. After removing SNPs with identical alleles between the

Guelph line and the first parent/ relative, the Guelph line then clusters with the second

parent/ next closest relative.

o Hypothesis 2. Of the clustering neighbours, an inbred line will share the greatest

percentage of SNP alleles with a parent, over other relatives.

o Hypothesis 3. If the correct parents are identified, 100% of the alleles in an inbred will be

accounted for by the parents.

Objective 3: To identify parental alleles in a group of siblings derived from an inter-

heterotic cross.

o Hypothesis 1. Progeny lines have an overrepresentation of alleles from the parent with

the same behaviour in crosses (heterotic pattern) as the progeny.

o Hypothesis 2. There are large parental linkage blocks common to all siblings.

Page 29: Analysis of Genotyping-by-Sequencing data in a maize (Zea

20

Objective 4: To identify QTLs for additive effects for grain yield using a mixed model

approach with phenotypic data from NCII.

o Hypothesis 1: A genomic relationship matrix, generated from GBS data, can be used to

describe the variance/covariance structure of the polygenic term in the model, and

substitute for the traditional pedigree-based matrix.

o Hypothesis 2: Based on the density x additive genetic variance present in Stiff Stalk,

determined from NCII ANOVA results, it is expected that QTLs will be detected in the

Stiff Stalk background.

o Hypothesis 3. Based on the absence of additive genetic variance in Iodent germplasm,

determined from NCII ANOVA results, it is expected that no QTLs will be detected in

the Iodent background.

o Hypothesis 4. QTLs for grain yield are numerous, with small effects.

Page 30: Analysis of Genotyping-by-Sequencing data in a maize (Zea

21

CHAPTER 3: APPLICATIONS OF GENOTYPING-BY-SEQUENCING DATA IN

MAIZE BREEDING PROGRAMS

3.1 ABSTRACT

As genotyping-by-sequencing (GBS) data becomes more abundant, new applications of

this technology in breeding programs are possible. While GBS data is typically used for genomic

selection and QTL mapping, GBS information can also be used to establish relationships

between lines and examine inheritance of parental linkage blocks. This study presents methods to

utilize GBS data in this manner using publically available data to: (1) examine heterotic patterns

in the germplasm; (2) identify putative parents for lines with unknown parentage; and (3)

examine the genomic structure of lines derived from breeding crosses. Marker data from nearly

1,200 inbred lines, including public, ex-PVP, and University of Guelph germplasm was used to

generate a genetic similarity matrix. Using the genetic similarity matrix, a network diagram and

hierarchical cluster diagram were generated, which reflected the heterotic patterns of this

germplasm. Hierarchical clustering was used to identify putative parents of Guelph lines from a

publically available dataset, as demonstrated with examples of known family structures. In two

sets of siblings derived from breeding crosses, parental linkage blocks were determined and

visualized across the chromosomes. These methods allow breeders to maximize the use of their

marker data by analyzing the relationships between lines in their breeding program.

3.2 INTRODUCTION

Maize breeding is divided into two different activities: inbred line development and

hybrid commercialization (Duvick and Cassman 1999; Lee and Tollenaar 2007). Inbred line

development, particularly in the commercial sector, now routinely uses molecular marker data to

predict the genetic value of plants through what is called genomic selection (GS) (Meuwissen et

al. 2001; Bernardo and Yu 2007; Piepho 2009; Crossa et al. 2010). Recently, genotyping-by-

Page 31: Analysis of Genotyping-by-Sequencing data in a maize (Zea

22

sequencing (GBS) methodologies have been developed to assess a large number of SNP markers

at low cost per sample in large genome species such as maize (Elshire et al. 2011). Currently,

GBS data for 17,289 maize lines, assessed at 955,690 markers, is publically available at

panzea.org. Many applications of GBS data have been described, including GS (Crossa et al.

2013; Zhang et al. 2015), QTL mapping (Romay et al. 2013; Li et al. 2015a; Li et al. 2015b;

Zhou et al. 2016), prediction of deleterious mutations (Mezmouk and Ross-Ibarra 2014), and

analysis of population structure (Romay et al. 2013; Wu et al. 2016). However, the potential of

this data for understanding the structure of the commercial maize germplasm pool, the

relationships between maize lines, and the inheritance of linkage blocks has not received much

attention.

Maize germplasm is complex, including varieties of popcorn, sweet corn, and field corn,

grown in diverse regions around the world. Over the last 100 years, field corn germplasm in the

U.S. has shifted from open pollinated cultivars to hybrids. In the early 1900s, it is estimated that

U.S. field corn consisted of 1000 open pollinated cultivars (Montgomery 1916). Genetic

improvement was achieved through recurrent selection of ears from female plants with desirable

traits. The concept of hybrid maize was then developed by Shull (1908, 1909) and East (1909),

with the concept of double-cross hybrids contributed by Jones (1918). Double-cross hybrids

began replacing open pollinated cultivars in the 1930s, and these were then replaced by single

cross hybrids, beginning in the 1960s (Crow 1998). The initial inbred lines were derived from

the self-pollination of virtually all of the open pollinated varieties (Troyer 1999). Private

companies began their breeding programs by utilizing the same public germplasm in the 1920s

and 1930s, and subsequently developing new inbred lines by self-pollinating superior,

commercial hybrids from their private breeding programs (Troyer 1999). Over time, germplasm

Page 32: Analysis of Genotyping-by-Sequencing data in a maize (Zea

23

has been exchanged between breeding programs through the self-pollination of competitors’

commercial hybrids (Troyer and Mikel 2010). Since the 1970s, breeding has become

increasingly privatized and the role of public inbred lines in commercial hybrids has continually

declined (Darrah and Zuber 1986; Mikel and Dudley 2006). The practice of deriving inbred lines

from competitors’ commercial hybrids stopped with the patenting of hybrids. Maize hybrids in

the U.S. are currently produced from propriety inbred lines protected by the U.S. Plant Variety

Protection Act (PVP), which was passed in 1970 (Mikel and Dudley 2006; Nelson et al. 2008)

and through plant patents. Lines with expired PVP (ex-PVP) can then be used in public breeding

programs and for research purposes to gain insight about proprietary germplasm.

The breeding of field corn utilizes heterotic patterns, which function to make heterosis

predictable, when lines from different heterotic patterns are crossed. These heterotic patterns are

typically maintained by keeping the populations of inbred lines genetically distinct from each

other (Bernardo 2001). New inbred lines are typically generated within a heterotic group by

recycling elite inbred lines, which maintain genetic dissimilarity between the heterotic groups

but also limits diversity within the heterotic group (Mikel 2008). Since most elite inbred lines in

a heterotic group are related to a few founder individuals, heterotic patterns can be described

based on these key founder individuals (Darrah and Zuber 1986; Nelson et al. 2008). The U.S.

maize heterotic patterns fit into two overarching groups, reflecting the major division between

heterotic patterns: Iowa Stiff Stalk Synthetic (Stiff Stalk) and non-Stiff Stalk (Duvick 2005;

Mikel and Dudley 2006; Mikel 2008). However, U.S. germplasm has also been described as

three main groups: Stiff Stalk (with founder lines B73, B14, B37), non-Stiff Stalk/ Lancaster

(Mo17, Oh43), and Iodent (PH207) (Darrah and Zuber 1986; Mikel and Dudley 2006).

Page 33: Analysis of Genotyping-by-Sequencing data in a maize (Zea

24

Molecular markers can be used to understand the structure of commercial germplasm

pools and the relationships between maize lines. A common method of population structure

analysis is hierarchical clustering, which has been performed using several types of markers,

beginning with RFLPs (Mumm et al. 1994), then using SSR (Barata and Carena 2006), and SNPs

(Nelson et al. 2011; Wu et al. 2016). Hierarchical clustering with SNP data has been used to

separate U.S. field germplasm from tropical, sweet corn and popcorn germplasm (Hansey et al.

2011). Clustering with SNP data also divides ex-PVP and public U.S. field corn germplasm into

two main subgroups, Stiff Stalk and non-Stiff Stalk lines (Nelson et al. 2008; Hansey et al.

2012). A dendrogram of 92 ex-PVP lines, created using 614 SNPs, resulted in six predominant

clusters, described by the key founder line in each cluster: B73, Mo17, PH207, A632, Oh43, and

B37 (Nelson et al. 2008).

While hierarchical clustering will always calculate clusters regardless of the strength of

the connection, there are other methods of analyzing population structure that reflect the degree

of similarity or dissimilarity between individuals. For example, principal component analysis

(PCA) uses a similarity matrix to identify principal components, to describe the variability of the

data. For 92 ex-PVP lines, the first two principal components from PCA created a tetrahedral

cloud with vertices of B73, Mo17 and PH207, reflecting the key founder lines in the germplasm

(Nelson et al. 2008). For 544 tropical lines, PCA indicated three clearly distinguished major

subgroups, which were consistent with the environmental adaptations of the germplasm and

CIMMYT breeding records (Wu et al. 2016). Another method, termed principal coordinate

analysis (PCOA), utilizes a dissimilarity matrix and reflects the distance between pairs of points.

The analysis of 2,815 maize inbred lines by PCOA, using GBS data, revealed groups of known

maize subpopulations, including tropical, sweet corn, popcorn, Stiff Stalk, and non-Stiff Stalk

Page 34: Analysis of Genotyping-by-Sequencing data in a maize (Zea

25

lines (Romay et al. 2013). A new, recently described approach is the network diagram, which

uses a similarity matrix as input. An algorithm, performed by the software Gephi, disperses the

lines in 2-dimensional space based on the simultaneous attraction of similar points and the

repulsion of dissimilar points (Bastian et al. 2009). This method has been applied to 212 ex-PVP

lines, which generated three main clusters of germplasm, Stiff Stalk, non-Stiff Stalk, and Iodent,

which reflect the heterotic patterns of the analyzed germplasm (Romay et al. 2013).

The application of marker data to investigate the inheritance of linkage blocks in

breeding populations has not received much attention. High density genotypic markers are ideal

for identifying signatures of selection, which are genomic regions that have been subjected to

selection pressure and likely contain genes underlying biologically important traits (Cadzow et

al. 2014). When a genomic region in two individuals contains alleles that were are inherited from

a common ancestor, this shared genomic segment is said to be identical-by-descent (IBD). High

density SNP data has been used to identify IBD regions in small populations of descendants from

three founder lines in Chinese maize breeding: Dan340, Huangzao4 and Mo17 (Liu et al. 2015).

In each population, the lines were genotyped for 40k SNPs, and genomic regions common

among all descendants, that also matched the founder parent, were identified as IBD regions,

with over a thousand IBD regions detected in each population. This method has not yet been

applied to temperate maize populations.

This paper examines several uses of GBS data in maize breeding programs. Specifically,

the utility of hierarchical clustering and a network diagram for assigning lines to heterotic

patterns is examined. Then, a strategy for using hierarchical clustering with GBS data to

establish parentage between lines in the absence of well-defined pedigree information is

Page 35: Analysis of Genotyping-by-Sequencing data in a maize (Zea

26

illustrated. And finally, this work demonstrates how GBS data can be used to examine the

genomic structure of progeny from various breeding crosses.

3.3 METHODS

3.3.1 Germplasm and marker data

The germplasm selected for sequencing included 126 inbred lines from the University of

Guelph maize breeding program, belonging primarily to the Stiff Stalk, Iodent and Lancaster

heterotic patterns, and 16 ex-PVP lines (Table 3.1). Leaf discs from several plants were bulked

and DNA was extracted (University of Wisconsin, Genomes to Field). GBS data was generated

for the 142 inbred lines by the Genomic Diversity Facility at Cornell University using the

method described by Elshire et al. (2011), hence referred to as the Guelph GBS data. The Guelph

GBS data was partially imputed and assessed at 955,690 SNPs. Publically available GBS data for

17,280 inbred lines was also obtained from panzea.org (file AllZeaGBSv2.7, posted December

18, 2013).

Table 3.1. Pedigrees and families of Guelph inbred lines and Ex-PVP lines used to generate GBS

data. Pedigree information from E. Lee (unpublished data) and U.S. plant variety protection

certificates.

Guelph lines Heterotic

Pattern

Family Pedigree

CG1

Funks G10

CG2

Funks G10

CG3

Funks G10

CG4

Funks G10

CG5

B73 Funks G10

CG6

Open-pollinated flint variety from the sub-alpine

area of Europe

CG7

from second cycle lines developed in England for

cold tolerance

CG8

Open-pollinated flint variety from the sub-alpine

area of Europe

CG9

from second cycle lines developed in England for

cold tolerance

Page 36: Analysis of Genotyping-by-Sequencing data in a maize (Zea

27

Table 3.1. cont.

Guelph lines Heterotic

Pattern

Family Pedigree

CG10

from second cycle lines developed in England for

cold tolerance

CG13

Golden Glow

CG14

Wigor

CG15

Wigor

CG16 Misc Wf9/Mn13 (CG7 x CM37)

CG17 Misc Wf9/Mn13 (CG7 x CM37)

CG18 Misc Wf9/Mn13 (CG7 x CM37)

CG19 Misc Wf9/Mn13 (CG7 x CM37)

CG20 Misc Wf9/Mn13 (CG7 x CM37)

CG21 Misc Wf9/Mn13 CH593-9 BC2 NRP GL5

CG22 Misc Wf9/Mn13 CH593-9 BC2 NRP GL5

CG23

CH591-23 BC2 NRP A498

CG24 Misc Wf9/Mn13 (CG19 x CG17) x Pioneer3950

CG25

CH593-9 BC3 NRP A498

CG26

F2 x F7

CG27 Stiff Stalks B73 B73 BC1 NRP early single-cross

CG28

CG Synthetic A (S) C0

CG29

CG Synthetic B (S) C0

CG30

CG Synthetic B (S) C0

CG31

CG HOPE A

CG32

CG HOPE A

CG33 Misc Wf9/Mn13 (CG19 x CG17) x Pioneer3950

CG34

(CG19 x CG17) x Pioneer3950

CG37

Pioneer3803

CG38

Pioneer3803

CG40

Pioneer3902

CG41 Iodent

Pioneer3902

CG42 Stiff Stalks B73

CG44 Iodent 207 Pinoeer3790

CG50

France-Canada Gene Pool

CG52

Morden Stiff Stalk Gene Pool

CG53 Misc Wf9/Mn13 Stalk Quality III

CG54 Misc Wf9/Mn13 Stalk Quality VII (Stress pop.)

CG55 Iodent

Pioneer3881

CG56 Iodent

Pioneer3925

Page 37: Analysis of Genotyping-by-Sequencing data in a maize (Zea

28

Table 3.1. cont.

Guelph lines Heterotic

Pattern

Family Pedigree

CG57 Stiff Stalks unrSS Pioneer3803

CG58 Iodent

Pioneer3906

CG60 Iodent 207 Pioneer3902

CG61 Iodent

Pioneer3902

CG62

Pioneer3902

CG63 Iodent

Pioneer3790

CG64 Iodent

Pioneer3790

CG65 Stiff Stalks unrSS Pioneer3803

CG66 Iodent

Pioneer3901

CG67 Iodent

Pioneer3901

CG68 Iodent

Pioneer3707

CG69 Misc Wf9/Mn13 Pioneer3929

CG70 Misc Wf9/Mn13 Pioneer3929

CG71 Misc Wf9/Mn13 Pioneer3929

CG73 Iodent

Stalk Quality VII (Stress pop.)

CG79 Stiff Stalks unrSS Pioneer3803

CG80

Pioneer3737

CG84 Misc Wf9/Mn13 Pioneer3929

CG85 Iodent

Pioneer3790

CG86 Iodent

Pioneer3790

CG88

CG Lancaster (RRS)

CG89 Misc Wf9/Mn13 S2 line from CG CBI x Pioneer3929

CG90

S2 line from CGSynA x Pioneer3902

CG91

S2 line from CG Hope IA x Pioneer3921

CG92

CG23 x Pioneer3921

CG93 Misc Wf9/Mn13 S5 line from Pioneer3969 x Pioneer3929

CG94 Misc Wf9/Mn13 S2 line from CG Lancaster (Bio) x Pioneer3921

CG95 Misc Wf9/Mn13 S2 line from FCGP x Pioneer3929

CG97 Iodent

S2 line from CG Wigor x Pioneer3921

CG98 Misc Wf9/Mn13 S2 line from SQII x Pioneer3929

CG99 Misc Wf9/Mn13 S2 line from CG SS (BIO) x Pioneer3921

CG100 Misc Wf9/Mn13 S2 line from SQI x Pioneer3929

CG101

S2 line from CG Lancaster (BIO) x Pioneer3921

CG102 Stiff Stalks B14 CG Stiff Stalk combined C2

CG104 Iodent 207 Pioneer3902

CG105 Iodent 207 Pioneer3902

Page 38: Analysis of Genotyping-by-Sequencing data in a maize (Zea

29

Table 3.1. cont.

Guelph lines Heterotic

Pattern

Family Pedigree

CG106 Misc Wf9/Mn13 (CG33 x CG34) x (BSL (S4) C7 x CGSyn A-NL)

(S4)

CG108 Iodent 207 Pioneer3902

CG110 Lancaster unknown CCGP A C2S2 x NK2555

CG111 Lancaster unknown CCGP B C2S2 x NK2555

CG112

(CG SynA C7 x Pioneer 3475)

CG113

(CG SynA C7 x Pioneer 3475)

CG114

(CG102 x G-4193)

CG115

CG CBI C3 x Pioneer 3876

CG118 Stiff Stalks B14 (CG65 x CG33) x CG102

CG119 Stiff Stalks B14 (CG65 x CG33) x CG102

CG120 Stiff Stalks B14 (CG65 x CG33) x CG102

CG121 Stiff Stalks B14 CG60 x CG102 Intermated pop.

CG122 Iodent 207 CG60 x CG102 SSD pop.

CG123 Lancaster Oh43-Iodent CGR01 x CG110

CG124 Lancaster Oh43 (CG103/CO422 x B100/CG88) x LH38

CG125 Iodent 207 (CG108 x Carribbean Flint) x CG108

CG126 Lancaster unknown CG111 x (CG111 x Mexican Dent)

CG127 Iodent 207 (CG62 x Mexican Dent) x CG62

CGR01 Lancaster unknown (B93 x NK2555) x NK2555

CGR02

(CG44x(MF4xP1247)-3)-B-2-1-1-1

CGR03 Lancaster Oh43-Iodent (B100 x NK2555) x NK2555

CGX333 Stiff Stalks B73 SD79 x SD80, white

HiC1 Stiff Stalks B14 CH04030 x CG102-2(G)-1-1-1

HiC3 Stiff Stalks B14 UR13088 x CG102-6(G)-1-4-1

HiC4 Stiff Stalks B14 UR13088 x CG102-3(G)-1-3-1

HiC5 Stiff Stalks B14 UR13088 x CG102-3(G)-1-4-1

HiC6 Stiff Stalks B14 UR01089 x CG102-2(G)-1-1-1

HiC8 Lancaster unknown CH05015 x CG33-1(R)-1-3-1

HiC9 Lancaster unknown CH05015 x CG33-1(R)-1-4-1

HiC11 Iodent unknown AR13026 x CG60/CG62-4(G)-1-1-1

HiC12 Iodent unknown AR13026 x CG60/CG62-4(G)-1-3-1

HiC13 Iodent unknown AR13026 x CG60/CG62-1(G)-1-1-1

HiC14 Iodent unknown AR13026 x CG60/CG62-1(G)-1-3-1

HiC17 Iodent unknown AR13026 x CG60/CG62-1(R)-1-2-1

HiC21 Iodent unknown AR13026 x CG60/CG62-5(R)-1-1-1

Page 39: Analysis of Genotyping-by-Sequencing data in a maize (Zea

30

Table 3.1. cont.

Guelph lines Heterotic

Pattern

Family Pedigree

HiC22 Iodent unknown AR13026 x CG60/CG62-5(R)-1-6-1

HiC23 Iodent unknown AR13026 x CG60/CG62-5(R)-1-11-1

HiC24 Lancaster unknown AR13026 x CG33-3(G)-1-3-1

HiC25 Lancaster unknown AR13026 x CG33-3(G)-1-5-1

HiC26 Lancaster unknown AR13026 x CG33-3(G)-1-6-1

HiC27 Stiff Stalks B14 AR13026 x CG102-18(R)-1-3-1

HiC28 Stiff Stalks B14 AR13026 x CG102-18(R)-1-4-1

HiC29 Stiff Stalks B14 AR13026 x CG102-18(R)-1-8-1

HiC30 Stiff Stalks B14 AR13026 x CG102-22(R)-1-6-1

HiC32 Stiff Stalks B14 AR13026 x CG102-24(R)-1-3-1

HiC33 Stiff Stalks B14 AR13026 x CG102-24(R)-1-5-1

Ex-PVP lines Heterotic

Pattern

Family Pedigree

(DK)2MCDB Lancaster unknown #1 2MA22 x 4780 composite

(DK)8M129 Lancaster BS11 78060A x 88144

(AS)5707 Lancaster C103

LH159 Lancaster BS11 Pioneer3160

LH210 Lancaster BS11 LH51 x BS11LH C3

LH216 Lancaster Mo17 (LH51 x LH123) x LH51

PHBW8 Stiff Stalks unrSS PHJ40 x PHW52

PHEG9 Stiff Stalks B84 PHG86 x PHW52

PHGG7 Lancaster Wf9 PHT64 x PHG49

PHHV4 Stiff Stalks B84 PHG69 x PHM44

PHK56 Lancaster Oh7-

Midland/Iodent

PHG47 x PHG35

PHKE6 Iodent KE6 PHG29 x PHG47

PHRE1 Stiff Stalks unrSS PHJ40 x PHR47

PHVJ4 Iodent VJ4 PHJ40 x 207

PHW53 Iodent 207 G50 x PHZ51

PHW80 Lancaster C103 PHK76 x PHN37

3.3.2 Network Diagram

A subset of the data consisting of 1,191 lines was used to generate the network diagram.

The selected lines included 110 Guelph inbred lines and 1,081 lines from the Panzea data set,

Page 40: Analysis of Genotyping-by-Sequencing data in a maize (Zea

31

which were deposited by Romay et al. (2013). The 1,081 lines selected are U.S. ex-PVP and

public field corn from the Stiff Stalk, Iodent and non-Stiff Stalk heterotic patterns. Markers were

filtered using TASSEL (Bradbury et al. 2007), for minor allele frequency (maf) ≥ 0.05 and

maximum 90% missing data. The 353,994 SNPs passing filtering criteria were re-coded to

additive components, indicating the number of minor alleles at each locus (i.e. 0, 1, 2) using

PLINK v1.9 (Purcell et al. 2007). An identity-by-state (IBS) matrix was calculated using the

method of VanRaden (VanRaden 2008). The network diagram was created with the IBS matrix

using the network visualization platform Gephi v.0.8.2 (Bastian et al. 2009). The force-directed

layout Force Atlas was used. The connection between nodes was filtered based on IBS values,

and different levels of this filtering parameter, from IBS > 0.1 to IBS > 0.9, were assessed. The

value of the filtering parameter that generated the best separation of clusters was selected, which

was an IBS value > 0.3.

3.3.3 Hierarchical clustering

The Panzea public dataset was filtered for lines that had pedigree information, resulting

in 2,554 lines. Then, 142 Guelph lines were added to this dataset, for a total of 2,696 lines. This

dataset contains popcorn, sweet corn, tropical, U.S. public and ex-PVP germplasm, and lines

derived in Guelph. Markers with maf ≥ 0.05 and maximum 90% missing data were retained,

using VCFtools v0.1.12b, with 288,878 SNPs passing filtering (Danecek et al. 2011). SNPs were

re-coded to additive components, using PLINK (Purcell et al. 2007). An IBS matrix was

calculated using the VanRaden subroutine (VanRaden 2008). The IBS matrix was scaled and

centered in R 3.1.3 (R Core Team 2015). A distance matrix was computed in R using the

Euclidean distance. Hierarchical clustering was performed on the distance matrix using the

Ward’s minimum variance method and a dendrogram was created.

Page 41: Analysis of Genotyping-by-Sequencing data in a maize (Zea

32

The clustering method to identify putative parents was identical to the clustering method

described above. Here, clustering was performed using one Guelph line at a time, or one set of

siblings at a time, with 2,570 other lines (2,554 lines from the Panzea dataset plus 16 ex-PVP

lines). For each Guelph line assessed, a new dataset was created, a new IBS matrix was

calculated and the clustering was performed. To assess the similarity between the line of interest

and the neighbours in the clustering output, the percentage of matching SNP alleles was

determined. To do this, a custom script, which ignored missing loci and calculated the number of

loci where the SNP alleles were identical between a line and the proposed relative was used. To

cluster a line with the second (or third) parent, SNP locations where the genotype of the line

matched the genotype of the proposed parent #1 were removed from the dataset. With this subset

of SNPs, the IBS matrix was recalculated and the clustering procedure repeated. To evaluate the

similarity between lines and putative parents identified through clustering, the percentage of

alleles unaccounted for by the parents was determined. Using 0 for major allele and 1 for minor

allele, the following scenarios were deemed as “unaccounted for”: (1) offspring was 0/0 and all

parents were 1/1; (2) offspring was 1/1 and all parents were 0/0; and (3) offspring was

heterozygous and all parents were 0/0 or all parents are 1/1. The percentage of alleles

unaccounted for was then calculated by dividing the number of unaccounted for loci by the

number of loci with non-missing data for all the lines. The percentage of unaccounted for alleles

in randomly generated parent and offspring combinations, from the dataset of 2,696 lines, was

determined by using a random number generator to select a line, parent #1 and parent #2 from

the dataset, then calculating the percentage of alleles unaccounted for between the chosen line

and parents, using the method described above. A histogram of the proportion of unaccounted for

alleles for 10,000 randomly generated pedigrees was generated in using a spreadsheet.

Page 42: Analysis of Genotyping-by-Sequencing data in a maize (Zea

33

3.3.4 Identity by descent

There were two unique sets of sibling inbred lines. The first set included seven siblings

derived from a Stiff Stalk by Iodent two-way inter-heterotic breeding cross. All seven inbred

lines behaved as Iodent inbred lines when crossed to inbred lines of defined heterotic patterns.

The second set contained three siblings that behaved as Stiff Stalk inbred lines when crossed to

inbred lines of defined heterotic patterns. This set was derived from the following three-way

breeding cross: [CG102 x (CG33 x CG65)]. CG102 and CG65 belong to the Stiff Stalk heterotic

pattern, while CG33 belongs to the WF9/Mn13 family of the Lancaster heterotic pattern. For

each inbred, SNPs locations were filtered for no missing data between the line and the parents,

using VCFtools (Danecek et al. 2011). A custom script then identified SNPs where the offspring

was homozygous and the allele was only present in one of the parents. Ideograms of parental

SNP alleles were created with PhenoGram (Wolfe et al. 2013) using chromosome lengths and

centromere positions obtained from B73 RefGen_v2 from Maize GDB (Andorf et al. 2010). For

the three-way cross inbred lines, the number of SNP alleles inherited from each parent was tested

for deviation from the expected 2:1:1 ratio using a chi-square test.

Parental blocks in each individual were determined using a custom script that identified

regions where a minimum of two consecutive SNP alleles originated from the same parent. The

length of the block was determined by subtracting the position of the beginning SNP from the

position of the ending SNP in the block. The average number and size of the blocks was

generated using a spreadsheet. A histogram of the sizes of parental linkage blocks for each

inbred line was generated using a spreadsheet.

To determine IBD regions shared between siblings within a dataset, each dataset was first

filtered for no missing data in any of the lines or parents. A custom script was then used to

Page 43: Analysis of Genotyping-by-Sequencing data in a maize (Zea

34

identify SNPs that were polymorphic for the parents and had identical alleles between all the

siblings. An ideogram of alleles common to all siblings in the dataset was generated using

PhenoGram (Wolfe et al. 2013). The shared SNPs between the seven siblings of the Stiff Stalk

by Iodent cross were visually inspected to identify regions of dense marker coverage, which

were then indicated on the ideogram.

3.4 RESULTS AND DISCUSSION

3.4.1. Classifying Lines into heterotic patterns using GBS data

The number of lines to that could be analyzed simultaneously with the network diagram

was limited due to the memory and processing power required to run the software. For this

network diagram experiment, the sample size was limited to approximately 1,200 lines. Because

this experiment aimed to generate nodes reflecting U.S. heterotic patterns, only U.S. field corn

inbred lines and Guelph inbred lines were selected. The selected set of inbred lines included 110

Guelph lines and 1,081 U.S. public and ex-PVP field corn lines from the Panzea data set, which

were deposited by Romay et al. (2013).

The network diagram generated from 1,191 lines resolved into three primary nodes which

correspond to the three main heterotic pools: Stiff Stalk, Lancaster/Non-Stiff Stalk and Iodent

(Figure 3.1). Within the Stiff Stalk and Lancaster main nodes there was further grouping of lines

into founder backgrounds. In the Lancaster node these groups corresponded to Oh43, Mo17 and

Wf9. In the Stiff Stalk node, while there was less definition between the groups, the groups

corresponded to B14, B73 and B37. Besides the three primary nodes, there were several other

minor groups of inbred lines. These smaller groups appear to be based on shared geographical

origin and did not fit into the three main nodes. The first is a grouping of Nebraska lines to the

left of the Iodent node (Figure 3.1). The second is a grouping of predominantly North Carolina

Page 44: Analysis of Genotyping-by-Sequencing data in a maize (Zea

35

lines to the right of the Iodent cluster. There are also many small clusters between the Lancaster

and Stiff Stalk groups, which tend toward Lancaster. The population structure of these lines is

listed as “unclassified” in the file of ex-PVP lines deposited by Romay et al. (2013). The

coloured grouping to the right of the Iodent node is composed of Guelph high-carotenoid (HiC)

inbred lines (Figure 3.1). The HiC inbred lines were derived from breeding crosses involving

populations belonging to the Argentine Orange flints (Burt et al. 2011). This method was first

used on GBS data by Romay et al. (2013) on a set of 212 ex-PVP inbred lines which also

resulted in generating three main nodes corresponding to Stiff Stalk, non-Stiff Stalk and Iodent

heterotic patterns. And similar to Romay et al. (2013), lines with shared pedigrees show close

proximity in the diagram. For the Guelph inbred lines, the assignment to a primary node

corresponds to the heterotic patterns of the inbred lines.

Hierarchical clustering was performed using the full set of 2,696 inbred lines, including

all of the lines present in the network diagram, as well as popcorn, sweet corn, and tropical

germplasm. The dendrogram showed separation of tropical germplasm (Nigeria, Mexico,

Thailand and North Carolina) from the rest of the lines, and also showed separation of sweet

corn and popcorn from the U.S. field corn (Figure 3.2, Appendix Figure 1). The separation of

these groups in into distinct clusters is consistent with what was observed by Hansey et al. (2011,

2012). Within U.S. field corn germplasm, there were further divisions between Stiff Stalk

germplasm (including the founder lines B73, B37 and B14) and non-Stiff Stalk germplasm, but

no clear division of Lancaster and Iodent within the non-Stiff Stalk cluster. The lack of clear

distinction between non-Stiff Stalk subgroups was also observed by Hansey et al. (2011, 2012).

Page 45: Analysis of Genotyping-by-Sequencing data in a maize (Zea

36

(A)

(B)

Figure 3.1. Network diagram for 1,081 public and ex-PVP lines and 110 Guelph lines generated

using IBS values greater than 0.3 (A). Points sharing the same colour (red, green, pink or dark

blue) indicate Guelph inbred lines derived from the same pedigree. Other Guelph lines are

coloured light blue. Founder inbred lines of heterotic patterns are labelled. The germplasm

clusters into three main heterotic pools: Stiff Stalk (B73, B37, B14), Iodent (PH207) and

Lancaster (Mo17, Oh43, Wf9). Network diagram, with labels, zoomed in to the region

containing B73 (circled) showing the level of detail generated in the diagram (B).

B73

B14

B37

PH207

Wf9

Oh43 Mo17

Page 46: Analysis of Genotyping-by-Sequencing data in a maize (Zea

37

Figure 3.2. Visual representation of the main division of branches in the dendrogram. The

branches containing tropical, popcorn, and sweet corn are labelled. The remaining branches of

the dendrogram contain U.S. field corn and Guelph lines, with the locations of key founder

individuals indicated.

Hierarchical clustering using Ward’s method was largely influenced by family structures

in the dataset, which complicates the interpretation of results and reduces the usefulness of this

approach. Using known family structures in the dataset, it was observed that sibling groups

derived from inter-heterotic crosses had an effect on the clustering of the parent lines. A group of

Guelph siblings were derived from PHJ40 x PHR25, an inter-heterotic cross of Stiff Stalk x

Iodent. When the dataset did not include any Guelph lines, the branches containing PHJ40 and

PHR25 diverge at the third split in the dendrogram and do not cluster close together, consistent

with their belonging to different heterotic groups (Fig. 3.2A, B). When all the Guelph lines were

included in the dataset, PHJ40 and PHR25 then cluster together, with their Guelph progeny (Fig.

3.2C). This example demonstrates the effect of shared offspring from an inter-heterotic cross on

the clustering output. The influence of a large numbers of progeny from an inter-heterotic cross

was also observed in the network diagram. The Stiff Stalk parent, PHJ40, was brought to the

periphery of the Iodent node, rather than being in the Stiff Stalk node, when progeny was

included in the analysis. This phenomenon should be considered when selecting a dataset for the

analysis of heterotic patterns.

Page 47: Analysis of Genotyping-by-Sequencing data in a maize (Zea

38

(A) (B)

(C)

Figure 3.3. Subset of the dendrogram showing closest neighbours of PHJ40 (A) and PHR25 (B),

located on branches that diverge in the third split of the dendrogram, when University of Guelph

lines are not included in the dataset. When progeny from PHJ40 x PHR25 are included in the

dataset, these lines cluster together, with their progeny (C).

The network diagram was more effective at assigning lines to heterotic patterns than

hierarchical clustering with Ward’s method. While both methods separated Stiff Stalk from non-

Stiff Stalk lines, only the network diagram generated distinct Lancaster and Iodent nodes within

the non-Stiff Stalk group. Another advantage of the network diagram was the stability of the

main nodes that were generated, corresponding to the Stiff Stalk, Iodent and Lancaster heterotic

groups. In contrast, multiple creations of the network diagram revealed that germplasm with

weak connections to the main nodes had low stability, and were placed at arbitrary locations in

the diagram.

The merit of the network diagram is that it allows for multiple connections between

inbred lines, which gives a spatial sense of the strength of the connections between inbred lines.

Page 48: Analysis of Genotyping-by-Sequencing data in a maize (Zea

39

With hierarchical clustering, the strength of the connection between inbred lines is not apparent

because every line is placed in the dendrogram, regardless of the extent of genetic similarity. In

the network diagram, lines with weak connections to any of the established groups can be easily

identified, as they are placed between the main nodes that form, or are placed in the periphery of

the diagram. The lines that fall between the main nodes are likely generated from a mix of

heterotic patterns and thus do not share strong similarity with either heterotic pattern. Within the

main nodes, the network diagram allows for the generation of sub-nodes, where the lines in the

center of the node have the strongest connection to other members of that node.

The previously described network diagram by Romay et al. (2013) has nicely defined

nodes with few points falling outside of the main nodes. In this study, the larger sample size of

nearly 1,200 lines, compared to 212 lines used by Romay et al. (2013), likely accounts for the

many small nodes falling outside of the main heterotic groups. The formation of clean nodes is

dependent on the germplasm used and the selection of lines with strong connections to other

inbred lines in the dataset. This research found that the network diagram is more informative

than to hierarchical clustering with Ward’s method for identifying heterotic patterns in US

maize.

3.4.2. Identification of putative parents of lines of unknown parentage using GBS data

One to the weaknesses of the hierarchical clustering approach however is also an attribute

that can be exploited to discover parentage of lines. Putative parents were identified using

hierarchical clustering of the dataset and identifying lines that clustered near the line of interest.

To test the usefulness of this clustering approach, the clustering output for a Guelph line with a

known pedigree, CG118, was investigated. The pedigree for CG118 is: CG102 x (CG33 x

Page 49: Analysis of Genotyping-by-Sequencing data in a maize (Zea

40

CG65). The initial clustering of CG118 placed it with CG120 (sibling), CG119 (sibling), CG102

(parent) and CG114 (half-sibling of CG118). After removing SNPs that were shared between

CG118 and CG102, CG118 clustered with CG33 (second parent), CG34 (sibling of CG33),

CG106 (descendant of CG33) and CG24 (sibling of CG33). Removing SNPs that were shared

between CG118 and CG33, resulted in CG118 clustering with CG65 (third parent) and CG57

(sibling of CG65). This example demonstrates that if the parents are in the dataset, the line will

cluster with its parents as well as close relatives, and that removing SNPs matching the first

parent will lead to clustering with the second (and third) parent.

Using the above example, what is the best strategy for identifying the parental line from a

group of lines within a cluster? The hypothesis that the progeny will share the greatest

percentage of SNP alleles with the parent, over other relatives, was tested. The percentage of

SNP alleles shared between a line with known pedigree information, CG118, and clustering

neighbours was determined. For the initial clustering, the percentage matching between CG118

and neighbours was: 73.16% with CG102 a parent, 67.18% with CG114 a half-sibling, 67.05%

with CG120 a sibling, and 57.35% with CG119 another sibling. Removing the CG102 alleles

and regenerating the second clustering, CG118 shared the greatest percentage of SNPs with

CG33 a parent, compared to CG34, CG106 and CG24. The lines CG119 and CG120, also

derived from CG102 x (CG33 x CG65), were then assessed for percentage matching with these

same neighbours and both had the greatest percentage matching with CG102. For the second

clustering, CG120 shared the greatest percentage of SNP alleles with CG33, and CG119 shared

the most SNP alleles with CG106, a descendant of CG33. From this example, using known

family structures, it is clear that the progeny does not always share the greatest percentage of

alleles with the parent, over another relative. Assessing the percentage of shared alleles alone is

Page 50: Analysis of Genotyping-by-Sequencing data in a maize (Zea

41

insufficient to distinguish parents from close relatives. In general, an inbred shares ~ 50% of

alleles with a parent, but also shares ~50% of alleles with full siblings. In addition, the genetic

relationships between maize lines are complicated because there is extensive inbreeding and

selection, which generates IBS values between relatives that will differ from what would be

found in livestock pedigrees, for example. For datasets with complex pedigree structures and

many relatives present, such as the example described, clustering appears to group the line with

its closest relatives but pedigree information is needed to sort out the nature of the relationships

between the lines.

This approach of using GBS data to identify putative parents via hierarchical clustering,

removing alleles shared with the parent and rerunning the analysis is subject to several sources of

error that could lead to alleles unaccounted for by the parents. The hypothesis that 100% of the

alleles in an inbred will be accounted for by the correct parents was tested by using 30 inbred

lines with known pedigrees (Table 3.2). The percentage of SNP alleles not accounted for by the

proposed parents ranged from 0.20% to 6.16%. These percentages may be higher than the

expected 0% due to the error rate of SNP calling in GBS data, which is reported at an average of

0.18% (Romay et al. 2013). Another possibility is that the seed source used for the GBS

sequencing of the publically available data may be different than the seed source used in the

Guelph breeding program and there may be small genetic differences between these lines.

Pedigree records may be incorrect, or finally mutation could generate novel SNPs. While the

percentage of alleles not accounted for by the correct parents can be as high as 6%, a low

percentage can also be obtained from an incorrect pedigree. For example, the correct linage of

CG118, which is CG102, CG33 and CG65, has a percentage of unaccounted for alleles of 0.30%.

Substituting CG33 for a full sibling, CG34, gives a percentage unaccounted for alleles of 0.79%.

Page 51: Analysis of Genotyping-by-Sequencing data in a maize (Zea

42

This method is unable to confirm a correct pedigree, but can identify putative pedigrees that are a

poor fit, with a high percentage of alleles unaccounted for by the parents. To give context for the

genetic similarity of lines and putative parents, the genetic similarity of 10,000 randomly

selected parents from the dataset was assessed. The proportion of alleles unaccounted for

between randomly selected lines and parents ranged from 0% to 37.50%, with a median of

14.89% (Figure 3.3). From this sampling of 10,000 randomly generated pedigrees, the 99th

percentile point was 6.41% of alleles unaccounted for by the parents, so obtaining a percentage

of 6.41% or less is unlikely to occur from a randomly generated pedigree.

Page 52: Analysis of Genotyping-by-Sequencing data in a maize (Zea

43

Table 3.2. The percentage of SNPs unaccounted for by either parent for inbred lines of known

parentage.

Inbred Parent 1 Parent 2 Parent 3 Total

SNPs

Number of alleles

unaccounted for

Percent

CG108 PHJ40 PHR25

118,485 234 0.20%

CG80 PHG29 PHG47

227,790 531 0.23%

PHKE6 PHG29 PHG47

138,473 345 0.25%

PHP85 PHK29 PHW52

247,111 638 0.26%

PHEG9 PHG86 PHW52

127,410 358 0.28%

CG118 CG102 CG33 CG65 173,142 512 0.30%

PHPR5 PHK76 PHW52

226,937 754 0.33%

CG119 CG102 CG33 CG65 116,816 393 0.34%

(NK)S8326 W117 Mo17 Mo17 221,812 858 0.39%

PHJ89 PHG47 PHT77

226,836 983 0.43%

PHW86 PHG71 PHG72

236,194 1,024 0.43%

(NK)W8555 B73 B84

247,588 1,252 0.51%

PHK56 PHG47 PHG35

90,879 463 0.51%

CG120 CG102 CG33 CG65 171,999 914 0.53%

LH74 A632 B73

251,461 1,340 0.53%

PHW80 PHK76 PHN37

133,451 743 0.56%

PHBA6 PHG47 PHZ51

226,971 1,350 0.59%

(NK)W8304 B14 B73 B73 256,948 1,554 0.60%

CG123 CGR01 CG110

71,101 460 0.65%

(NK)778 W117 B37

218,080 1,418 0.65%

(NK)807 W117 B37

215,138 1,443 0.67%

PHRE1 PHJ40 PHR47

140,547 1,081 0.77%

LH216 LH51 LH123 LH51 114,893 1,020 0.89%

CG122 CG60 CG102

75,149 890 1.18%

CG26 F2 F7

169,699 2,206 1.30%

LH214 LH123 LH51

236,921 5,066 2.14%

LH213 LH123 LH51

235,589 5,418 2.30%

CG121 CG60 CG102

77,335 2,122 2.74%

PHJ90 PHG50 PHK42

199,376 9,163 4.60%

PHW53 PHG50 PHZ51

142,058 8,747 6.16%

Page 53: Analysis of Genotyping-by-Sequencing data in a maize (Zea

44

Figure 3.4. Histogram of the proportion of SNP alleles unaccounted for by either parent when

inbred lines and parents are randomly sampled from the dataset (n=10,000).

After using examples of lines with known pedigrees, this clustering method was applied to

lines from the Guelph breeding program which were created by self-pollinating commercial

single-cross hybrids. This was based on the assumption that one or both of the parents of the

single-cross hybrids were present in the dataset of ex-PVP inbred lines. For groups of full-

siblings in the dataset, the group of siblings was clustered with the public and ex-PVP lines, and

then each inbred was clustered individually with the public and ex-PVP lines. Comparing the

clustering results of groups of siblings, there were several types of outcomes observed, with the

percentage of alleles unaccounted for being used to assess the genetic closeness of the proposed

pedigree.

The first outcome was that each sibling clustered with the same ex-PVP inbred line in the

first and second clustering. For example, a group of five siblings first clustered with PHJ40, then

PHG72 in the second clustering (Figure 3.4). The percentage of alleles unaccounted for by either

Page 54: Analysis of Genotyping-by-Sequencing data in a maize (Zea

45

of these proposed parents ranged from 0.54% to 0.76% for the five siblings, making it highly

probable that PHJ40 and PHG72 were the correct parents for this group of offspring.

(A) (B)

Figure 3.5. Clustering of five siblings places all five lines with PHJ40 as the putative parent (A).

Removing SNPs with identical alleles between the lines and PHJ40 places each line with

PHG72, showing CG64 as an example (B).

The second type of outcome was that some siblings in the group clustered with the same

proposed parents, but one sibling clustered with different lines. In a group of five siblings, four

siblings each clustered with LH85, then PHJ40 (Figure 3.5). The percentage of alleles

unaccounted for by either of the proposed parents ranged from 3.00% to 3.96% for these four

lines. The fifth sibling, CG42, did not cluster with the rest of the siblings (Figure 3.6), suggesting

that perhaps the pedigree records of CG42 are incorrect and it is not a sibling. In another

example, seven of the eight siblings clustered with PHJ40, then PHR25, with the percentage of

unaccounted for alleles ranging from 0.43% to 0.64%. The eighth sibling, CG60, had a much

higher percent of unaccounted for alleles, 6.34% for PHJ40 x PHR25. Initial clustering placed

CG60 with PHJ40 but the second clustering did not place CG60 with any ex-PVP lines. In the

case of CG60, it is most likely a progeny from PHJ40 x PHR25, as previous SSR marker work

indicated a high degree of IBD with the other siblings (Lee et al. 2006, 2007).

Page 55: Analysis of Genotyping-by-Sequencing data in a maize (Zea

46

(A) (B)

Figure 3.6. Clustering of five siblings places four of the siblings together with LH85 as the

putative parent (A). For these four siblings, removing SNPs with identical alleles to LH85 places

each line with PHJ40, showing CG70 as an example (B). The fifth sibling, CG42, clusters

elsewhere (Figure 3.6).

(A) (B) (C)

Figure 3.7. Initial clustering places CG42 with Tx714 and N201 (A). Removing SNPs with

identical alleles to N201 (B) or Tx714 (C) places CG42 with NC278 as the second putative

parent.

Another outcome is that lines cluster with more than one close relative. Initial clustering

for CG42 placed it with N201 and Tx714 (Figure 3.6). Tx714 and N201 are genetically similar

and cluster together when CG42 is not in the dataset. Selecting either line as the first parent

results in clustering with NC278 as the second parent. The percentage of alleles unaccounted for

by proposed parents is 0.69% for Tx714 x NC278, and 0.76% for N201 x NC278. It appears that

both proposed pedigrees are a good fit, and the correct lineage could not be determined without

pedigree records.

The final outcome is that there was no consensus within a group of siblings as to which

inbred lines were the putative parents. This was observed for a group of five siblings (CG37,

CG38, CG57, CG65, and CG79) when lines were clustered individually (Figure 3.7). Two of the

siblings clustered with SD102 with differing surrounding ex-PVP lines. The other three siblings

Page 56: Analysis of Genotyping-by-Sequencing data in a maize (Zea

47

had identical clustering results, all clustering with another group of ex-PVP lines. It is speculated

that the actual parents of these lines are perhaps not in the dataset. It may be useful to cluster

lines with a different set of germplasm in hopes of obtaining consistent clustering with one ex-

PVP line for all five siblings. While GBS data can be used to verify pedigree records and, in

some cases, identify putative parents of lines derived from unknown inbred lines, the major

limitation with this technique is that parents cannot be discerned from close relatives. However,

knowing the exact parentage may not be necessary, as identifying germplasm that is similar to

the line of interest is still beneficial to a plant breeder.

(A) (B)

(C) (D) (E)

Figure 3.8. Clustering of five siblings did not reveal a clear putative parent. When each sibling

was clustered individually, two of the siblings, CG57 (A) and CG65 (B), clustered with SD102,

with differing surrounding ex-PVP lines. The other three siblings, CG37 (C), CG38 (D) and

CG79 (E), clustered with the same set of ex-PVP lines, which differs from the clustering in (A)

and (B).

Page 57: Analysis of Genotyping-by-Sequencing data in a maize (Zea

48

3.4.3. Determining genome contribution of the parents from a breeding cross using GBS

data

The proportion of alleles derived from each parent was determined for the two sets of

Guelph siblings: set #1 from a Stiff Stalk x Iodent breeding cross and set #2 from a three-way

cross involving two Stiff Stalk parents and one Lancaster parent. The seven siblings from set #1

were previously classified as Iodent inbred lines, therefore an overrepresentation of Iodent alleles

was expected. The three siblings from set #2 were classified as Stiff Stalk inbred lines, and the

expectation for this group was an over representation of Stiff Stalk alleles. These two

expectations were not consistently met. For example, the proportion of SNP alleles inherited

from the Iodent parent in set #1 ranged from 38.76% to 68.87% (Figure 3.8). For each inbred

line in the three-way cross dataset, the number of SNP alleles derived from each parent was

significantly different than expectations (2:1:1) (p < 0.001 for all tests of line vs. expectation),

although all lines share a majority of alleles with CG102, as expected (Figures 3.9 and 3.10).

These three siblings behave as Stiff Stalks, so a minimization of alleles from the Lancaster

parent, CG33, was expected. Instead, two of the inbreds had greater than the expected 25%

contribution of CG33. Surprisingly, these results suggest that in an inter-heterotic cross,

minimizing alleles from one heterotic pattern is not required for the line to behave as the other

heterotic pattern, based on the markers used in this study.

Page 58: Analysis of Genotyping-by-Sequencing data in a maize (Zea

49

CG62 (68.9%) CG108 (54.7%) CG104 (53.9%)

CG40 (51.9%) CG41 (50.0%) CG105 (47.3%)

CG61 (38.8%)

Figure 3.9. Parental linkage block distribution across the seven inbreds derived from the two-

way breeding cross, PHJ40 (Stiff Stalk, blue) and PHR25 (Iodent, red). Of the SNP alleles that

could be assigned to a parent, the percentage of SNP alleles inherited from the Iodent parent is

shown.

Page 59: Analysis of Genotyping-by-Sequencing data in a maize (Zea

50

CG118 CG119 CG120

Figure 3.10. Parental linkage block distribution across the three inbreds derived from the three-

way breeding cross. Linkage blocks in blue represent were inherited from CG102, red from

CG33, and green from CG65.

Figure 3.11. The expected and observed percentage of parental contribution in the 3-way cross

derived inbred lines based on SNP alleles that could be assigned to a parent. Parental lines:

CG102 (blue), CG65 (green) and CG33 (red). The number of markers used for each line in this

analysis are as follows: 26,467 for CG118, 16,987 for CG119, and 26,668 for CG120.

44%

39%

17%

CG118

63%

31%

6%

CG119

49%

22%

29%

CG120

50%

25%

25%

Expectation

CG102

CG33

CG65

Page 60: Analysis of Genotyping-by-Sequencing data in a maize (Zea

51

While all the siblings in set #1 were derived from the same commercial hybrid, the inbred

CG108 was derived though modified full-sib mating, in contrast to the other siblings, which were

derived through the traditional self-pollination approach (Lee and Kannenberg 2004). The loss of

heterozygosity, in theory, is more gradual in the full-sib mating method compared to the rapid

inbreeding of the self-pollination method. Recombination events are only detectable in the inbred

lines if the recombination occurred in a region with heterozygosity. Because heterozygosity

persists longer in the full-sib mating method, it was expected the CG108 would have more

detectable recombination events, resulting in a larger number of small parental linkage blocks

than the other siblings. In contrast to this expectation, the number of linkage blocks in CG108

was the lowest among all the siblings (Table 3.3), and CG108 had proportionally fewer small

linkage blocks than the other lines (Figure 3.11). The linkage blocks in CG108 were fewer and

larger than the blocks of the other siblings, suggesting that CG108 had much lower rates of

recombination in heterozygous regions than the other siblings.

Table 3.3. Number and size of parental linkage blocks identified in 10 University of Guelph

inbred lines derived from breeding crosses.

Size of parental linkage blocks

Inbred Number of

SNPs

Length (base pairs) Number of

blocks

Set #1 max average

max average

CG40 2,625 70.88

122,320,764 1,960,463 1,004

CG41 2,041 49.03

98,933,285 1,351,890 1,432

CG61 5,395 54.52

190,407,262 1,528,351 1,287

CG62 1,417 71.36

86,974,542 2,049,483 951

CG104 2,012 51.04

88,093,024 1,397,499 1,383

CG105 3,036 37.35

138,508,477 1,022,729 1,869

CG108 1,472 191.42

121,714,668 10,133,250 196

Set #2

CG118 1,658 61.82

135,930,065 4,246,284 423

CG119 787 94.61

135,425,492 10,140,401 178

CG120 1,316 51.62

127,833,832 3,707,412 510

Page 61: Analysis of Genotyping-by-Sequencing data in a maize (Zea

52

Figure 3.12. Histogram of sizes of parental linkage blocks, with the frequency of blocks in each

bin expressed as a percentage of the total number of linkage blocks for that inbred line.

Another potential application of GBS data is to identify parental genome segments that

are common across a group of selected progeny. In the Stiff Stalk x Iodent dataset, only 377

SNPs were polymorphic in the parents and shared between all seven siblings (Figure 3.12A).

From visual inspection of the ideogram, five parental genome segments were identified with

dense marker coverage. These regions contained 19 to 140 SNPs and are likely more meaningful,

in terms of selection during breeding, than regions marked by a single SNP (Table 3.4).

Unexpectedly, there were no large linkage blocks from the Iodent parent that were common to all

siblings. This suggests that lines derived from a two-way inter-heterotic cross do not require

large common linkage blocks from the Iodent parent for the inbred line to behave as an Iodent

heterotic pattern. In the three-way cross dataset, 4,289 SNPs were polymorphic for the parents

and shared between the three siblings (Figure 3.12B). The largest regions common across the

siblings were a region of CG102 on most of chromosome 10 and a region of CG33 on

Page 62: Analysis of Genotyping-by-Sequencing data in a maize (Zea

53

chromosome 1. These shared regions may be due to random chance, rather than selection

pressure, but it is proposed that shared regions between siblings contain genes underlying

behavior as a Stiff Stalk and for desirable agronomic traits such as earliness. These examples

demonstrate that GBS data has applications for determining parental contribution to offspring,

parental genome segments, and shared regions between siblings in a maize breeding program.

(A) (B)

Figure 3.13. Ideograms of SNP alleles shared by all seven siblings from the Stiff Stalk x Iodent

cross (A) and all three siblings in the 3-way cross (B). Regions in blue represent PHJ40 and red

represent PHR25 (A); regions in blue represent CG102, red CG33, and green CG65 (B). Regions

of dense SNP markers are indicated by green boxes (A).

Table 3.4. IBD regions between the seven siblings of the Stiff Stalk x Iodent cross. These region

of dense marker coverage were selected based on visual inspection of the ideogram (Figure

3.11).

Chromosome Start

position

End

position

Length

(bp)

Number of

SNPs

Parent

1 280345879 295437051 15,091,172 43 PHJ40

2 232879401 236413006 3,533,605 41 PHR25

4 159297901 167074846 7,776,945 140 PHR25

9 4290930 8037689 3,746,759 19 PHJ40

9 23540249 26831698 3,291,449 28 PHR25

Page 63: Analysis of Genotyping-by-Sequencing data in a maize (Zea

54

3.5 CONCLUSIONS

This paper explores alternative applications of GBS data to maize inbred lines. Lines can

be assigned to heterotic patterns more effectively using a network diagram approach than the

traditional hierarchical clustering. Hierarchical clustering is effective for identifying putative

parents for lines with unknown parentage or to confirm pedigree records. The percentage of SNP

alleles unaccounted for by either putative parent can be used to gauge whether the proposed

pedigree is a good fit. Finally, GBS data can be used to determine the extent of parental

contribution as well as identify parental genome segments in progeny derived from breeding

crosses.

Page 64: Analysis of Genotyping-by-Sequencing data in a maize (Zea

55

CHAPTER 4: IN SILICO MAPPING OF MAIZE GRAIN YIELD QTLS USING

STRUCTURED CROSSES

4.1 ABSTRACT

In silico mapping integrates QTL mapping with plant breeding by detecting QTLs using

existing phenotypic data from breeding program trials. North Carolina Design II (NCII) is a

mating design commonly used in maize breeding to assess general and specific combining

ability. The use of in silico mapping in maize was explored using yield data from an NCII

breeding scheme with Stiff Stalk and Iodent commercial caliber material. The 110 hybrids were

evaluated for grain yield at three plant densities in three locations over three years. A mixed

linear model was used to test GBS marker alleles for associations with additive effects for grain

yield in each heterotic pattern. The matrix to describe the relationships between lines was

derived using two different methods: pedigree records and genotyping-by-sequencing data. The

methods produced very similar results, suggesting that marker data can substitute for pedigree

records in generating the relationship matrix. This research identified 123 significant SNP

associations for additive effects for grain yield in Stiff Stalk inbreds located on five

chromosomes. Six of the 12 bins containing QTLs have been reported to contain a grain yield

QTLs by previous studies. The SNPs together explain approximately 9.38% of the phenotypic

variance. No significant SNP associations were found for Iodent inbreds, demonstrating the

uniqueness of QTLs to specific heterotic backgrounds. QTLs detected from this approach can be

used for marker assisted selection or genome-wide selection in the breeding program. This paper

demonstrates an in silico mapping mixed model approach to integrate QTL mapping with NCII

using GBS data.

Page 65: Analysis of Genotyping-by-Sequencing data in a maize (Zea

56

4.2 INTRODUCTION

Yield improvement efforts now incorporate genotypic data as an essential component of

plant breeding programs (Bernardo and Yu 2007). With the decreasing costs of marker data,

genotypic data now plays a large role in breeding through marker-based selection and breeding

value prediction using genome-wide selection. Identification of genomic regions influencing

traits of interest utilizes either a linkage mapping approach or an association mapping approach.

Traditional QTL mapping for grain yield has been performed using linkage mapping, which

utilizes a bi-parental cross to generate a mapping population. The recombination events that

occur from the bi-parental cross and subsequent inbreeding of the F2 lines creates linkage

disequilibrium between the markers and QTLs in this mapping population, facilitating QTL

detection. Previous studies have identified maize grain yield QTLs using linkage mapping with

populations composed of F2 derived lines (Austin and Lee 1996; Ribaut et al. 1997; Malosetti et

al. 2008) and populations derived from crossing the F2:F3 lines, RILs or double haploid lines

with a tester (Melchinger et al. 1998; Ho et al. 2002; Boer et al. 2007).

An alternate approach to linkage mapping is association mapping, also referred to as

linkage disequilibrium mapping or genome wide association study (GWAS). This approach

utilizes historic linkage disequilibrium, arising from historical recombination events, in a

population of related individuals (Zu et al. 2008). This linkage disequilibrium is present in the

entire population analyzed, rather than only being present in an experimentally generated

population. The use of existing phenotypic and genetic data to detect QTLs through

computational methods is referred to as in silico mapping (Grupe et al. 2001).

In silico association mapping can be performed using elite germplasm, which has a

number of advantages over traditional linkage mapping with a bi-parental cross: (1) As the set of

Page 66: Analysis of Genotyping-by-Sequencing data in a maize (Zea

57

materials that the QTLs are detected in is elite germplasm, the information can be used directly

for marker assisted selection in a breeding program (Parisseaux and Bernardo 2004; Crepieux et

al. 2005); (2) Phenotypes can be generated with environmental replicates, reducing

environmental effects (Zhang et al. 2005); (3) Using existing phenotypic data has reduced cost

compared to generating and assessing phenotypes for a large mapping population (Parisseaux

and Bernardo 2004); and (4) A dataset of elite germplasm has greater potential for QTL

discovery because the parents of the bi-parental cross are likely to be monomorphic for some

markers and will have limited allelic diversity compared to the population as a whole (Parisseaux

and Bernardo 2004; van Eeuwijk et al. 2010; Bink et al. 2012).

This in silico mapping approach can either be applied to an association mapping panel or

to hybrid data from a maize breeding program. An association mapping panel is a collection of

elite inbred maize lines that are genotyped and phenotyped to use in GWAS. An association

mapping panel of 2,279 maize inbred lines from the USDA germplasm collection was used in

conjunction with GBS data for GWAS of growing degree days to 50% silking (Romay et al.

2013). Phenotypic data from breeding programs can also be used as an association mapping

panel. Zhang et al. (2005) mapped QTL for growing degree day heat units to pollen shedding

using 189 microsatellite markers and phenotypic data for 282 maize inbred lines belonging to

Pioneer Hi-bred International (Johnston, IA).

An alternate application of in silico mapping uses phenotypic data for maize hybrids,

generated in a breeding program, and genotypic data for parental inbred lines. This approach

detects QTLs in the parental inbred lines, with QTLs detected in each heterotic pattern

separately. This approach, as compared to an association mapping panel, has several advantages,

including: (1) The use of hybrid data can capture the heterotic phenotype of traits such as grain

Page 67: Analysis of Genotyping-by-Sequencing data in a maize (Zea

58

yield and plant height, which cannot be achieved using phenotypic data of parental lines; (2) It

allows for detection of QTLs unique to a heterotic pattern, as well as cross-validation of QTLs

between heterotic patterns (Parisseaux and Bernardo 2004); and (3) Hybrid trials are conducted

in environments to which they are adapted, ensuring that detected QTLs are relevant to

environments used for crop production.

In silico association mapping using maize hybrid data from breeding programs has been

previously described. Parisseaux and Bernardo (2004) detected QTLs for grain moisture, plant

height and smut resistance using 96 SSR markers and 22,774 hybrids from the Limagrain

genetics program (France). The hybrids were generated from 1,266 inbred lines using nine

combinations of the nine heterotic groups. van Eeuwijk et al. (2010) detected QTLs for ear

height, plant height and yield using 769 SNPs and hybrid phenotypic data from Pioneer Hi-bred

International. The germplasm included 1,700 hybrids generated by crossing lines from two

heterotic groups. Both of these studies generated hybrids by crossing individuals from different

heterotic pools in various combinations, rather than systematically generating hybrids. The use of

a systematic mating design for in silico QTL discovery has not yet been explored.

North Carolina Design II (NCII) (Comstock and Robinson 1952) is a commonly used

mating design that is ideal for the integration of QTL mapping with existing maize hybrid data.

This mating design can be used to assess general and specific combining ability, and identify

superior inbreds and parental combinations. The NCII is a systematic mating design originally

developed for livestock, but it has been routinely used in maize as it nicely accommodates

heterotic patterns. In the NCII design, all female lines, belonging to one heterotic pattern, are

crossed to all male lines, belonging to a second heterotic pattern. The NCII structure partitions

the genotypic variance into effects due to the female, effects due to the male, and effects due to

Page 68: Analysis of Genotyping-by-Sequencing data in a maize (Zea

59

the interaction of the male and female (Hallauer et al. 2010a). The genetic effects due to males

and females are equivalent to additive genetic effects, while the genetic effects due to the

interaction are equivalent to non-additive genetic effects (Rojas and Sprague 1952). This is an

efficient method to assess whether there is sufficient additive genetic effects in each heterotic

group to warrant a QTL mapping experiment with the data. The model described in this paper

aims to detect additive allele effects, and not dominance allele effects, because additive allele

effects generate predictable phenotypes, making them utilizable in a breeding program with

marker assisted selection.

This paper examines the potential of utilizing a structured mating design and GBS data

for in silico mapping. Inbred lines representing two heterotic patterns and possessing some

common ancestry were used to examine the potential of in silico QTL mapping. This study

utilized existing hybrid data for grain yield, at different plant densities, from an NCII consisting

of elite short season Stiff Stalk and Iodent inbred lines. While previous studies in silico mapping

studies have used pedigree information to establish the additive genetic relationships between

lines, this paper demonstrates the use of GBS data to estimate genomic relationships.

4.3 METHODS

4.3.1 Germplasm and Field Trials

Using a North Carolina Design II (Comstock and Robinson 1952) 110 hybrids were

created by crossing 11 Stiff Stalk inbred lines as females to 10 Iodent inbred lines (Table 4.1).

The inbred lines were comprised of ex-PVP inbred lines and inbred lines developed at the

University of Guelph (see Table 4.1). The 110 hybrids were grown at three locations (Alma,

Elora and Waterloo, ON) for three years (2009-2011) at three plant population densities (37,000,

74,000, and 148,000 plants ha-1) using a split-plot design with density as the main plot and

Page 69: Analysis of Genotyping-by-Sequencing data in a maize (Zea

60

hybrids as sub-plots. Trials consisted of two replications per location and experimental units

were 2-row plots with 5.79 m rows, 0.76 m between rows, and 0.91 m between ranges. Data was

recorded for machine-harvestable grain yield, grain moisture, and bulk density (test weight). See

Holtrop (2016) for more details on the yield trials.

Pedigree records for the Stiff Stalk and Iodent heterotic groups were used to construct a

pedigree-based relationship matrix for each group. Additive relationship coefficients were

calculated from the pedigree records using the tabular method (Emik and Terrill 1949;

Henderson 1976). The diagonal elements were set to the default of 1, assuming parents of the

inbred line are unrelated.

Table 4.1. Pedigree of Stiff Stalk and Iodent inbred lines used in the NCII. Pedigree information

from E.A. Lee (Unpublished data) and United States plant variety protection certificates.

Inbred Female Heterotic Group Pedigree

CG37 Stiff Stalk Pioneer 3803

CG38 Stiff Stalk Pioneer 3803

CG57 Stiff Stalk Pioneer 3803

CG65 Stiff Stalk Pioneer 3803

CG79 Stiff Stalk Pioneer 3803

CG102 Stiff Stalk CG Stiff Stalk Combined C3

CG118 Stiff Stalk (CG65/CG33)/CG102

CG119 Stiff Stalk (CG65/CG33)/CG102

CG120 Stiff Stalk (CG65/CG33)/CG102

PHJ40 Stiff Stalk PHB09/PHB36

PHG71 Stiff Stalk/Iodent A632Ht/PH207

Inbred Male Heterotic Group Pedigree

PHG42 Iodent PH207/(PH207/PH806)

PHG29 Iodent PH207/(PH207/PH806)

PHG50 Iodent/Unrelated PH848/PH207

PHG72 Iodent PH814/PH207

PHG83 Iodent/Lancaster/Unrelated PH814/PH207

PH207 Iodent PH3BD2/PHG3RZ1

CG44 Iodent PHJ40/PHG72

CG60 Iodent PHJ40/PHR25

CG85 Iodent PHJ40/PHG72

CG108 Iodent PHJ40/PHR25

Page 70: Analysis of Genotyping-by-Sequencing data in a maize (Zea

61

4.3.2 Molecular marker data

Genotyping-by-sequencing (GBS) data for the 21 inbred lines used in NCII was

generated by the Genomic Diversity Institute at Cornell, using the method of Elshire et al. (2011)

and Romay et al. (2013). To create the marker data for in silico mapping, the SNPs were filtered

for maf ≥ 0.05 and call rate of 100% (i.e., no missing data) using PLINK v1.9 (Purcell et al.

2007). SNPs were coded for the number of minor alleles (0,1,2) for the purpose of estimating

additive effects (Purcell et al. 2007). Ideograms, showing the distribution of the markers over the

chromosomes, were created using PhenoGram (Wolfe et al. 2013), with sizes of maize

chromosomes and centromere positions from Maize GBD (Andorf et al. 2010).

The GBS data was also used to construct the genomic relationship matrix (G) for each

heterotic pattern. For the inbred lines in each heterotic pattern, SNPs were filtered for minor

allele frequency (maf) ≥ 0.05 and minimum call rate of 10%, with 173,490 SNPs passing

filtering criteria for Iodent lines and 159,180 SNPs for Stiff Stalk lines. For each heterotic

pattern, the G matrix, reflecting the allelic similarity, or identity-by-state (IBS), between the

inbred lines, was created using the method of VanRaden (VanRaden 2008). In order to facilitate

inversion of the G matrices, a value of 10 was added to diagonal elements. The genomic

relationship coefficients, as presented in the results, were calculated from coefficients in the G

matrix, using the following formula for individuals i and j:

𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑜𝑓 𝑖, 𝑗 = 𝐺𝑖𝑗

√𝐺𝑗𝑗 ∗ √𝐺𝑖𝑖

Page 71: Analysis of Genotyping-by-Sequencing data in a maize (Zea

62

4.3.3 Mixed Model Analysis

Broadsense heritability for grain yield, combined over years, planting densities and

replicates, was calculated by fitting the following model in ASReml 3.0 (Gilmour et al. 2009):

𝑦 = 𝑖𝑛𝑑 + 𝑒 (1)

where y is grain yield, 𝑖𝑛𝑑 is the individual hybrids and 𝑒 is the residual error (𝑒~𝑁(0, 𝐼𝜎𝑒2)). A

pedigree file was used to connect the hybrids to their Stiff Stalk and Iodent parents and contained

limited pedigree information for the parents. Broadsense heritability was estimated using the

individual and error variances: ℎ2 = 𝜎𝑖𝑛𝑑2 (𝜎𝑖𝑛𝑑

2 + 𝜎𝑒2)⁄ .

QTLs were identified using a mixed model approach based on Parisseaux and Bernardo

(2004):

𝑦 = 𝑋1𝛽 + 𝑋2𝑛 + 𝑋2𝑑 + (𝑀1𝛼1 𝑜𝑟 𝑀2𝛼2) + 𝑍1𝑔1 + 𝑍2𝑔2 + 𝑒 (2)

Where y is the vector of phenotypic observation (grain yield), β is a vector of fixed effects

(including overall mean μ, density and replicate), 𝑛 is the environment (coded for combination of

year and location) (𝑛~𝑁(0, 𝐼𝜎𝑛2)), 𝑑 is the density by environment variable (coded for

combination of density, year and location) (𝑑~𝑁(0, 𝐼𝜎𝑑2)), 𝛼1 and 𝛼2 vectors of random

additive genetic effects associated with markers in heterotic groups 1 and 2 (𝛼1~𝑁(0, 𝐼𝜎𝛼12 ),

𝛼2~𝑁(0, 𝐼𝜎𝛼22 )), 𝑔1 and 𝑔2represent random polygenic effects of heterotic groups 1 and 2

(𝑔1~𝑁(0, 𝐺1𝜎𝑔12 ), 𝑔2~𝑁(0, 𝐺2𝜎𝑔2

2 )), and e the residual variance (𝑒~𝑁(0, 𝐼𝜎𝑒2)), where I is an

identity matrix, and 𝑋1, 𝑋2, 𝑀1, 𝑀2, 𝑍1 𝑎𝑛𝑑 𝑍2 are incidence matrices of 1s and 0s relating y to

𝛽, 𝑛 and 𝑑, 𝛼1, 𝛼2, 𝑔1 and 𝑔2, respectively. Given the genetic structure of the hybrids and to

maximize the number of markers used for association mapping, each heterotic group was

analyzed separately, with either 𝑀1𝛼1 𝑜𝑟 𝑀2𝛼2 in the model. This model differs from Parisseaux

and Bernardo (2004) in the treatment of markers as random effects as well as including

Page 72: Analysis of Genotyping-by-Sequencing data in a maize (Zea

63

environment and density by environment variables in the model as random effects. Previous

studies have treated environmental effects as fixed, but here they are assumed to be random

because environment is considered a random effect in NCII ANOVA analysis. The model for no

QTL is identical to equation (2) except that marker terms are removed:

𝑦 = 𝑋1𝛽 + 𝑋2𝑛 + 𝑋2𝑑 + 𝑍1𝑔1 + 𝑍2𝑔2 + 𝑒 (3)

Log likelihood (LogL) estimates for the models were computed using ASReml 3.0, using a

restricted maximum likelihood (REML) approach (Gilmour et al. 2009). The likelihood ratio test

(LRT) statistic was computing using:

𝐿𝑅𝑇 = −2(𝑙𝑜𝑔𝐿2 − 𝑙𝑜𝑔𝐿1)

P-values for the test statistics at each SNP were calculated using a 𝜒2 distribution (1 df). P-

values were adjusted for multiple testing using positive false discovery rate (FDR) using PROC

MULTTEST (SAS version 9.4 SAS Inst. Inc., Cary, NC). The predicted values of the observed

phenotype (�̂�) from model (3) were considered the adjusted phenotype. The adjusted phenotype

was modelled with all significant SNPs fit simultaneously as fixed effects using PROC GLM.

The R-squared for this model was used as an approximation for the phenotypic variance

explained by all the significant SNPs. The preceding mixed model analysis was also conducting

using pedigree-based relationship matrices in place of the G matrices.

4.4 RESULTS AND DISCUSSION

Complete analysis of variance (ANOVA) of grain yield was done previously (Holtrop

2016), Briefly, density, genotype, environment and all the interactions were significant sources

of variation. General productivity (i.e., grain yield across all environments and densities) was

controlled by mostly non-additive genetic effects. Density tolerance and yield potential,

Page 73: Analysis of Genotyping-by-Sequencing data in a maize (Zea

64

however, were due to mostly additive genetic variation found exclusively among the Stiff Stalk

inbred lines. The Iodent inbred lines used in the study did not exhibit significant additive genetic

variation for grain yield when grown at different plant densities. The broadsense heritability (H2)

estimate for grain yield was H2 = 0.23, which is consistent with H2 estimates in other studies:

0.07 to 0.37 (Badu-Apraku 2010), 0.19 (Silva et al. 2013), and 0.13 to 0.24 (Hallauer et al.

2010b). Based on the NCII ANOVA analysis, the Stiff Stalk inbred lines exhibited a significant

density by additive genetic variation interaction, while the Iodent lines did not exhibit significant

additive genetic variation for either general productivity or in response to changes in plant

density. Given these initial observations, it was hypothesized that QTLs would only be detected

within the Stiff Stalk inbred lines.

4.4.1. Genomic relationship matrices

For each heterotic group, genomic relationship matrices were constructed to describe the

variance-covariance structure of the polygenic terms in the model. The G matrices for Stiff Stalk

and Iodent lines are presented in Tables 4.2 and 4.3. The off-diagonal elements reflect the

similarity between the pair of lines, based on the shared alleles between the lines, as compared to

the allelic frequency in the population. Large positive coefficients indicate high similarity

between that pair of lines. Negative coefficients can be interpreted as negative correlations. A

negative coefficient implies that the two individuals are more dissimilar than the average pair of

individuals in the dataset. Diagonal elements in the G matrix represent the degree of inbreeding,

which reflects the probability that two genes in the individual are identical by descent. In the

tabular method of pedigree-based relationships, a diagonal value of 1 is used if parents are

unrelated. Larger coefficients indicate high homozygosity in the individual.

Page 74: Analysis of Genotyping-by-Sequencing data in a maize (Zea

65

To contrast the genomic relationship values with traditionally used pedigree-based

values, the pedigree records were used to construct pedigree-based relationships for Stiff Stalk

(Table 4.4) and Iodent (Table 4.5). The pedigree records of the germplasm in NCII indicate high

degrees of relatedness between many of the lines. Comparison of pedigree-based relationship

values with the G matrix shows that these genomic relationship values are consistently lower

than values obtained from pedigree records. Five of the Stiff Stalk lines are full siblings, derived

from the same hybrid, with a relationship coefficient of 0.5. The genomic relationship values

between the five full siblings ranges from -0.05 to 0.22. A set of three Stiff Stalk siblings

(CG118, CG119, CG120) are derived from a three-way cross, in which one parent, CG102, and

one grandparent, CG65, are also used in the NCII, with relationship coefficients of 0.5 and 0.25,

respectively. The genomic values between the three siblings ranges from 0.02 to 0.30. The

genomic relationship of the siblings with CG102 ranges from 0.22 to 0.35, while their

relationship with CG65 ranges from -0.31 to -0.09. For the Iodent inbreds, there are two sets of

full siblings: CG44 and CG85, and CG60 and CG108. Full sibling pedigree relationship values

are 0.5. The genomic relationship between CG44 and CG85 is 0.14, and the genomic relationship

between CG60 and CG108 is 0.31. These sets of siblings both have PHJ40 as a parent, making

them half-siblings, and also both have PH207 as a grandparent on the other side of the pedigree,

resulting a relationship coefficient of 0.31. The genomic relationships between the siblings

ranges from 0 to 0.09.

Page 75: Analysis of Genotyping-by-Sequencing data in a maize (Zea

66

Table 4.2. GBS-based relationship matrix for the Stiff Stalk inbred lines.

CG37 CG38 CG57 CG79 CG65 CG102 CG118 CG119 CG120 PHG71 PHJ40

CG37 1.004

CG38 0.104 0.754

CG57 0.018 0.163 1.599

CG79 0.223 0.05 0.089 0.925

CG65 0.142 -0.026 -0.049 0.104 1.577

CG102 -0.300 -0.244 -0.342 -0.286 -0.386 1.662

CG118 -0.254 -0.235 -0.241 -0.229 -0.24 0.222 1.635

CG119 -0.299 -0.224 -0.27 -0.277 -0.309 0.349 0.298 1.181

CG120 -0.201 -0.236 -0.294 -0.224 -0.091 0.256 0.019 0.061 1.594

PHG71 -0.449 -0.448 -0.235 -0.344 0.022 0.097 0.215 0.156 0.003 1.534

PHJ40 -0.060 -0.217 -0.083 -0.193 -0.117 -0.085 -0.175 -0.221 -0.103 -0.218 2.901

Table 4.3. GBS-based relationship matrix for the Iodent inbred lines.

CG44 CG85 CG60 CG108 PH207 PHG72 PHG50 PHG83 PHG29 PHK42

CG44 0.812

CG85 0.135 2.259

CG60 -0.002 0.049 1.115

CG108 0.029 0.088 0.134 0.987

PH207 -0.256 -0.472 -0.278 -0.227 0.927

PHG72 0.045 -0.075 -0.216 -0.229 0.088 1.332

PHG50 -0.090 -0.041 0.012 -0.075 -0.389 -0.238 2.878

PHG83 -0.192 -0.214 -0.143 -0.128 -0.011 -0.232 -0.121 1.800

PHG29 -0.275 -0.459 -0.284 -0.260 0.584 -0.002 -0.348 0.014 1.025

PHK42 -0.237 -0.384 -0.272 -0.238 0.459 0.006 -0.313 -0.095 0.477 1.084

Page 76: Analysis of Genotyping-by-Sequencing data in a maize (Zea

67

Table 4.4. Pedigree-based relationship matrix for the Stiff Stalk inbred lines.

CG37 CG38 CG57 CG79 CG65 CG102 CG118 CG119 CG120 PHG71 PHJ40

CG37 1

CG38 0.5 1

CG57 0.5 0.5 1

CG79 0.5 0.5 0.5 1

CG65 0.5 0.5 0.5 0.5 1

CG102 0 0 0 0 0 1

CG118 0.125 0.125 0.125 0.125 0.25 0.5 1

CG119 0.125 0.125 0.125 0.125 0.25 0.5 0.5 1

CG120 0.125 0.125 0.125 0.125 0.25 0.5 0.5 0.5 1

PHG71 0 0 0 0 0 0 0 0 0 1

PHJ40 0 0 0 0 0 0 0 0 0 0 1

Table 4.5. Pedigree-based relationship matrix for the Iodent inbred lines.

CG44 CG85 CG60 CG108 PH207 PHG72 PHG50 PHG83 PHG29 PHK42

CG44 1

CG85 0.5 1

CG60 0.31 0.31 1

CG108 0.31 0.31 0.5 1

PH207 0.25 0.25 0.25 0.25 1

PHG72 0.5 0.5 0.125 0.125 0.5 1

PHG50 0.125 0.125 0.125 0.125 0.5 0.25 1

PHG83 0.125 0.125 0.125 0.125 0.5 0.25 0.25 1

PHG29 0.19 0.19 0.19 0.19 0.75 0.38 0.38 0.38 1

PHK42 0.19 0.19 0.19 0.19 0.75 0.38 0.38 0.38 0.5 1

Page 77: Analysis of Genotyping-by-Sequencing data in a maize (Zea

68

The pedigree-based and genomic relationship coefficients presented here differ due to the

base populations used to estimate population allele frequencies. The pedigree-based method

assumes a base population of all the maize lines, with population allele frequencies reflective of

the entire maize population. In contrast, the genomic relationship method uses the lines in NCII

as the base population. For both the Stiff Stalk and Iodent groups, the “populations” are very

small and consist of highly related lines, which are not reflective of the maize population as a

whole. The calculation of the genomic relationships compares the alleles shared between a pair

of lines to the frequency of the alleles in the population. Calculating relationships using these

very small populations, consisting of one heterotic pattern only, results in values that are smaller

than pedigree-based estimates, close to zero, or even negative.

To demonstrate the influence of the base population on genomic relationship values, the

genomic relationships between inbred lines were calculated using a large, diverse base

population, similar to what is assumed using pedigree estimates. Several of these inbred lines

used in the NCII were included in a genomic relationship analysis of nearly 1,200 maize lines

including public and expired proprietary germplasm from Stiff Stalk, Iodent and non-Stiff Stalk

heterotic patterns (see Chapter 3). In this analysis, the genomic relationship of CG65 with CG57

(expected 0.5) was 0.34, CG65 with CG118 and CG120 (expected 0.25) were 0.18 and 0.29

respectively, CG102 with CG118 and CG120 (expected 0.5) were 0.44 and 0.47 respectively,

and CG118 with CG120 (expected 0.5) was 0.32. These examples show that calculating the

genomic relationships using a large, diverse base population results in values that are comparable

to pedigree-based estimates. This genomic relationship matrix used in this research was derived

using the small population, with relationship coefficients reflective of the similarity between

these lines, based on the context of the small population analyzed.

Page 78: Analysis of Genotyping-by-Sequencing data in a maize (Zea

69

4.4.2 QTL detection using genomic and pedigree-based matrices

This mapping experiment was conducted twice, using genomic and then pedigree-based

matrices to describe the polygenic variance-covariance structure in the model, representing the

relationships between individuals. For the Iodent germplasm, both approaches did not detect any

significant SNP associations with grain yield at different planting densities. Significant

associations were detected in the Stiff Stalk germplasm using both approaches, although the

number of significant loci differed (Figure 4.1). The scatter plots of LRT values (Figure 4.1),

show “levels” of markers with the same LRT value, which correspond to same patterns of alleles

across the inbred lines for markers within the same level (Table 4.6). While the LRT levels, and

the markers within each level, are identical between the two approaches, the number of

significant markers differs. The G matrix approach identified 123 significant associations, while

the pedigree-based relationship matrices generated an additional 224 significant associations

(Table 4.7). The SNP loci significantly associated with grain yield at different planting densities

together explain approximately 9.38% of the phenotypic variance (adjusted for environmental

effects), using the results of either the G or pedigree-based matrices.

Page 79: Analysis of Genotyping-by-Sequencing data in a maize (Zea

70

Figure 4.1. Scatter plot of Likelihood Ratio Test statistic (LRT) values for Stiff Stalk markers for

additive effects for grain yield at different planting densities using the G matrix (top) and

pedigree matrix (bottom). Lines indicate FDR adjusted q-value < 0.05.

Page 80: Analysis of Genotyping-by-Sequencing data in a maize (Zea

71

Table 4.6. The levels of LRT values observed in the scatter plot (Figure 2) correspond to

different patterns of alleles across the inbred lines. Level 1 indicates the highest LRT value. The

numbers for the alleles indicate the count of minor alleles (i.e. 0 = homozygous major allele, 1 =

heterozygous, 2 = homozygous minor allele).

Alleles

Inbred line Level 1 Level 2 Level 3 Level 4

CG102 0 0 0 0

CG118 0 0 0 0

CG119 0 0 0 0

CG120 0 0 0 0

CG37 0 0 0 0

CG38 2 2 2 2

CG57 2 2 2 2

CG65 0 0 0 0

CG79 0 0 0 0

PHG71 2 2 1 2

PHJ40 1 2 2 0

Table 4.7. Mapping with the pedigree-based matrix results more markers with significant

associations than mapping with the genomic relationship matrix. Likelihood ratio test (LRT)

statistics and corresponding adjusted p-value (q-value) for SNPs in the different LRT levels are

shown. Q-values < 0.05 are indicated with a *.

Genomic matrix Pedigree-based matrix

LRT level Number of SNPs LRT q-value LRT q-value

1 3 24.2 0.0012 * 24.98 0.0005 *

2 118 20.78 0.0012 * 22.5 0.0005 *

3 2 13.68 0.0478 * 15 0.0169 *

4 224 11.56 0.0541 13.7 0.0169 *

5 1 9.26 0.1857 11.44 0.0563

The observation of markers occupying distinct levels of LRT values in the scatter plots

(Figure 4.1), as opposed to markers being spread over many LRT values, is likely due to the

population structure of the germplasm, in which there are limited combinations of alleles

between individuals. These limited combination of alleles may be due to the small sample size,

inbreeding and high level of relatedness of the lines used. The genotypes of markers with

Page 81: Analysis of Genotyping-by-Sequencing data in a maize (Zea

72

significant associations were cross-referenced with the grain yield performance of each inbred

line at the three planting densities (Table 4.8). For low density (37 k ha-1), there appeared to be

no effect of genotype on grain yield. However, at commercial and high density (74 k ha-1 and

148 k ha-1, respectively), the lines homozygous for the major allele had positive gi values (with

the exception of CG79 at 148 k ha-1), while lines that are heterozygous or homozygous for the

minor allele had negative gi values. This pattern suggests that homozygosity for the major allele,

at these detected loci, confers favourable density tolerance at commercial and high densities.

Table 4.8. For markers in the top three LRT levels, homozygosity for the major allele is

associated with high grain yield at commercial (74 k ha-1) and high (148 k ha-1) population

densities. The inbred lines are ordered according to the number of minor alleles at loci in the top

3 LRT levels (see Table 4.6). Estimates of grain yield, expressed as gi, are shown for each inbred

line for each planting density (Holtrop 2016). These estimates reflect the difference between the

average yield of all progeny of a parental line and the average yield of all hybrids grown in an

environment.

Inbred line Allele in top

3 LRT levels

gi (Mg ha-1)

37,000 ha-1 74,000 ha-1 148,000 ha-1

CG102 0 -0.41 0.59 0.79

CG118 0 0.33 0.53 0.31

CG119 0 -0.02 0.32 0.52

CG120 0 -0.05 0.31 0.80

CG37 0 -0.10 0.11 0.56

CG65 0 0.36 0.07 0.26

CG79 0 0.37 0.23 -0.07

PHJ40 2/1 -0.17 -0.17 -0.51

PHG71 2/1 -0.07 -0.48 -0.90

CG38 2 -0.18 -0.62 -0.72

CG57 2 -0.06 -0.90 -1.03

Grand Mean 7.99 9.95 9.04

LSD gi(0.05) 0.24 0.24 0.24

LSD gi- gj (0.05) 0.35 0.35 0.35

Page 82: Analysis of Genotyping-by-Sequencing data in a maize (Zea

73

In this study, the G and pedigree-based approaches for generating the relationship matrix

produced very similar QTL mapping results. The two approaches generated slightly different

LRT values, with this difference resulting in different q-values for the markers (Table 4.7). The

G matrix approach was more stringent, with a smaller number of significantly associated loci

detected. The LogL values generated by ASReml, describing the fit of the model to the data,

were nearly identical between the two approaches, with the G matrix (LogL-4302.30) having a

slightly better fit than the relationship matrix (LogL -4302.48). With the increasing availability

of genetic data, the use of markers to generate a relationship matrix provides an alternative to the

traditional approach of using pedigree records. This research suggests that in cases where

pedigree records are limited or unknown, the G matrix is a suitable replacement, generating very

similar results to the pedigree-based matrix.

It has been proposed that deriving the relationship matrix from genetic data actually

produces more accurate values than using pedigree-based estimates, because marker data can

account for Mendelian sampling, giving more precise estimates than pedigree records (Hayes et

al. 2009; Pryce et al. 2012), and because pedigree records commonly contain errors, generating

inaccurate relationship values (VanRaden 2008). In the present study, the use of a G matrix

conferred little advantage over the pedigree-based matrix in terms of the fit of the model but did

generate a more stringent output of significant SNP associations. Despite the differences in the

relationship coefficients produced from the two methods, the use of a G matrix over the

pedigree-based matrix did not drastically alter the results, which may be because the marker

alleles explained a larger amount of the phenotypic variance than either relationship matrix. The

relative importance of the relationship matrix in explaining the phenotypic variance, compared

other terms in the model, may be affected by the germplasm used or the phenotypic trait

Page 83: Analysis of Genotyping-by-Sequencing data in a maize (Zea

74

analyzed. A direction for future research is to compare the mapping results of G and pedigree-

based matrices for an NCII using different germplasm or a different phenotypic trait. The

following sections of this paper focus on the results from the G matrix analysis.

4.4.3 Informative SNPs used for in silico mapping

In this mixed model approach, the heterotic groups were treated separately for testing

marker associations. Filtering resulted in 27,398 SNPs for Stiff Stalk and 32,401 for Iodent used

for testing marker associations, which are distributed across all the maize chromosomes (Figure

4.2). Of the SNPs passing filtering in each heterotic group, only 11,383 SNPs were common to

both groups. The large number of markers unique to each group may be due to the

presence/absence variation in the maize genome, in which some genomic regions may be present

only in one heterotic group, or due to uncalled (missing) GBS data in one of the groups.

Testing marker associations in the groups separately allowed for the testing of many

more associations than if only shared SNPs were tested. The drawback of using SNPs unique to

one heterotic pattern is that additive by additive effects cannot be modelled and tested. In this

study, the 11,383 SNPs were used to test for additive by additive effects, but no significant

associations were detected. It is proposed that the SNPs that are unique to a heterotic pattern are

in fact the most useful, rather than the ones that are shared. There may be certain traits or QTLs

that have been selected for in the different heterotic groups independently, since these groups of

germplasm have been kept genetically distinct in proprietary breeding programs.

Page 84: Analysis of Genotyping-by-Sequencing data in a maize (Zea

75

(A) (B)

Figure 4.2. Ideogram illustrating the genome coverage of the SNP markers used for in silico

mapping. Black bands show the position of (A) 27,398 SNPs in Stiff Stalk inbred lines and (B)

32,401 SNPs in Iodent inbred lines.

4.4.4. QTL detection and NCII

The outcome of NCII ANOVA (Holtrop 2016), indicated that Stiff Stalk inbred lines

exhibited a significant density by additive genetic variation interaction but do not have

significant additive effects for general productivity. There was no significant additive genetic

variation in Iodent lines for general productivity or for density by additive genetic variation

interaction. Consistent with the ANOVA results, this QTL mapping study detected significant

associations for grain yield at different planting densities in Stiff Stalk inbreds, but did not detect

any significant associations in Iodent lines.

This study demonstrates the use of NCII for QTL mapping, which allows mapping results

to be cross-referenced with the results of NCII ANOVA. Significant associations were detected

in the Stiff Stalk background using the mixed model approach, but not in the Iodent background,

Page 85: Analysis of Genotyping-by-Sequencing data in a maize (Zea

76

despite a large number of SNPs used to test marker associations (Figure 4.2B). While it is not

unusual for a QTL to be detected in one heterotic pattern only, the absence of detectable QTLs in

a heterotic pattern has not been reported in previous studies (Parisseaux and Bernardo 2004; Van

Eeuwijk et al. 2010). It appears that the Iodent germplasm was not optimal for QTL mapping for

grain yield, considering the lack of significant additive genetic variation reported in the NCII

ANOVA. Additionally, the average performance of half of the Iodent lines was not significantly

different than the overall average for grain yield (Holtrop 2016), which greatly reduces the

opportunity to identify loci related to above and below average yields. By using NCII germplasm

for a QTL mapping experiment, the NCII ANOVA can be used as a screening tool to determine

if there is sufficient additive genetic variation in the germplasm to warrant QTL mapping.

The results of the NCII ANOVA indicated that only source of additive genetic variation

for grain yield was the Stiff Stalk by density interaction. Given this, the detected QTLs are not

QTLs for general productivity but instead are QTLs for the interaction of planting density and

grain yield. In this study, the detected grain yield QTLs in Stiff Stalk germplasm are more

accurately described as QTLs for additive genetic effects for the impact of planting density on

grain yield. In this way, the use of NCII ANOVA results to detect the source of the additive

genetic variation in the germplasm facilitates a greater understanding of QTLs detected in the

QTL mapping experiment.

4.4.5. QTLs detected in Stiff Stalk

The mixed model approach with the G matrix identified 123 SNPs in Stiff Stalk

associated with additive effects for grain yield that together explain approximately 9.38% of the

phenotypic variance, adjusted for environmental effects. These SNPs are located on five

chromosomes (Figure 4.3), with the positions of the significantly associated SNPs listed in

Page 86: Analysis of Genotyping-by-Sequencing data in a maize (Zea

77

Supplemental Table 1. The SNPs with significant associations are located in bins 1.04, 1.10,

3.02, 3.04, 3.05, 3.08, 5.03, 9.02, 10.02, 10.03, 10.04, 10.05, using bins reported by Maize GDB

(Andorf et al. 2010).

Figure 4.3. Ideogram showing the chromosomal locations of the SNPs that were significantly

associated with regions influencing grain yield in the Stiff Stalk inbred lines.

It is expected that yield, a quantitative trait, would be controlled by many QTLs, each

with small effects. When mapping with elite germplasm, large effect QTL are unlikely to be

segregating in the population, since major favourable alleles are likely already fixed. Considering

the close genetic relationships among the Stiff Stalk lines, there are likely large linkage

disequilibrium (LD) blocks within the population. Detected SNPs that are found in close

proximity could therefore be referred to more accurately as detected LD regions, assuming that

the detected SNPs within an LD block are linked to the underlying QTL(s) within the LD block.

Page 87: Analysis of Genotyping-by-Sequencing data in a maize (Zea

78

The detected SNPs, likely reflecting LD regions, are located in bins on chromosomes 1

(bin 1.04), 3 (3.02, 3.04, 3.05, 3.08), 5 (5.03), 9 (9.02) and 10 (10.02, 10.03, 10.04, 10.05).

Maize grain yield QTLs have been previously reported in bins 1.04 (Austin and Lee 1996;

Messmer et al. 2009), 1.10 (Ribaut et al. 1997; Melchinger et al. 1998), 5.03 (Beavis et al. 1994;

Melchinger et al. 1998; Nikolić et al. 2012), 10.02 (Kozumplik et al. 1996), 10.03 (Stuber et al.

1992; Ribaut et al. 1997), and 10.04 (Stuber et al. 1992; Ajmone-Marsan et al. 1995; Ajmone-

Marsan et al. 1996; Melchinger et al. 1998), using Maize GBD (Andorf et al. 2010). This

research appears to be the first to report QTLs for grain yield in bins 9.02, 10.05 and the bins

detected on chromosome 3. Validation of QTLs through comparison with other studies can be

difficult due to differences in germplasm and markers used.

These SNPs with significant associations are represented by 56 gene models, 8 of which

have putative functions, according to Maize GDB (Andorf et al. 2010). These gene models

include six transcription factors (outer cell layer1, c2c2-gata TF31, c2h2 TF235, NAC TF67,

MYB TF134, bHLH TF153), ATPase1 and glutathione transporter1 (Supplemental Table 1).

These genes function in plant development as well as in response to biotic and abiotic stress.

While these SNPs only have a statistical association with grain yield, and not a biological one,

transcription factors involved in abiotic and biotic stress response are likely candidates to

influence grain yield.

4.4.6 Discussion of the mixed model approach

This paper demonstrated association mapping for the complex trait of grain yield using a

small sample size of 110 lines. Previous in silico mapping studies have used larger sample sizes

of: 404 inbred lines (Zhang et al. 2005), 1,700 hybrids (van Eeuwijk et al. 2010) and 22,774

hybrids (Parisseaux and Bernardo 2004). This experiment was expected to have lower power to

Page 88: Analysis of Genotyping-by-Sequencing data in a maize (Zea

79

detect QTLs than previous studies due to a smaller sample size and a large number of QTL

underlying the trait (Yu et al. 2005). However, the use of high marker coverage is able to

increase the power of QTL detection (Yu et al. 2005). The present study demonstrates that QTL

mapping for a complex trait, using a small sample size, can be achieved using the high marker

coverage of GBS data.

The method presented here for mapping with a NCII population structure is applicable to

other maize breeding programs as well as any crop utilizing heterosis. The mixed model

approach is flexible and can be used for mapping multiple phenotypic traits simultaneously

(Malosetti et al. 2008) or for a multi-QTL analysis using a Bayesian approach (van Eeuwijk et al.

2010). In addition, QTL by environment effects can be investigated by including data from

weather stations (Boer et al. 2007). While this mixed model included only additive effects,

mixed models can include a dominance term, as demonstrated by Yu et al. (2005), for studies

investigating the mechanism of heterosis or the prediction of hybrid performance. This approach

is applicable to any breeding program using NCII, to increase the efficiency of QTL detection in

their elite germplasm, and is applicable for any phenotypic trait of interest in the hybrid

germplasm.

4.6 CONCLUSIONS

This paper describes a mixed model approach for in silico mapping in maize with NCII

using GBS data. Using in silico mapping with a G matrix to describe relationships between lines,

123 SNPs associated with approximately 12 genomic regions showing additive effects for grain

yield in elite Stiff Stalk germplasm were identified. Collectively, these regions explained

approximately 9.38% of the phenotypic variance observed in grain yield. Several of these

genomic regions are novel, including QTLs in bins 9.02, 10.05 and four bins on chromosome 3.

Page 89: Analysis of Genotyping-by-Sequencing data in a maize (Zea

80

Mapping using both pedigree-based and GBS-derived relationship matrices demonstrated that

these relationship matrices can be interchanged for similar results. The NCII ANOVA, showing

no significant additive effects in Iodent lines, but showing significant density by additive

variation in Stiff Stalk lines gives strength to the in silico mapping results. The approach

described here illustrates the integration of QTL mapping with maize breeding program data

using a common mating design and the utility of a GBS relationship matrix.

Page 90: Analysis of Genotyping-by-Sequencing data in a maize (Zea

81

CHAPTER 5: GENERAL DISCUSSION

5.1 APPLICATIONS OF GENOTYPING BY SEQUENCING DATA IN MAIZE

BREEDING PROGRAMS

This thesis describes several methods of GBS data analysis using data from the

University of Guelph maize breeding program, in conjunction with publically available data.

This research compared the effectiveness of a newly described method, the network diagram, to

the commonly used hierarchical clustering method for analyzing population structure. The

network diagram was shown to be more effective for generating U.S. field corn heterotic patterns

than hierarchical clustering using Ward’s method. While hierarchical clustering will place all

lines in the dendrogram with no reflection of the strength of the connection, the network diagram

allows for lines to be placed outside of or between key clusters and can identify the lines at the

core of each cluster. This research expanded on the work of Romay et al. (2013) by demonstrated

that this method is effective for large datasets (nearly 1,200 lines) and datasets including ex-PVP,

public and University of Guelph germplasm. This study also demonstrated that GBS data has

applications for identifying close relatives of maize lines using hierarchical clustering, and

described several outcomes of this approach. Finally, GBS data was used to determine the extent

of parental contribution as well as to identify parental genome segments in progeny derived from

breeding crosses. The identification of parental genome segments in progeny from inter-heterotic

crosses can be used to investigate the nature of heterotic patterns and heterosis. This method can

also be applied to intra-heterotic crosses used in the breeding program, to determine favourable

linkage blocks that have been selected in many progeny.

With the large amount of genotypic data becoming available, methods to apply this data

to benefit maize breeding programs are lacking. This thesis meets that need by describing

methods that are suitable for any maize population to benefit a breeding program and also serve

Page 91: Analysis of Genotyping-by-Sequencing data in a maize (Zea

82

as a foundation for further development of methods to analyze GBS data. As computational

methods become available, it is important to evaluate the success of new approaches against the

old, more common methods of analysis. The discovery that the network diagram was superior to

the common method of hierarchical clustering using Ward’s method, in terms of reflecting maize

heterotic patterns, demonstrates the importance of evaluating new methods as they are

developed. The techniques described in this thesis will allow researchers to apply GBS data to

maize breeding programs in novel ways, facilitating a greater understanding of maize germplasm

as well as benefiting breeding programs.

5.2 IN SILICO MAPPING OF MAIZE GRAIN YIELD QTLS USING STRUCTURED

CROSSES

In this thesis, a mixed model approach was used for in silico mapping in maize with NCII

using GBS data. With the genomic relationship matrix, this study identified 123 significant

associations for additive genetic effects for impact of planting density on grain yield in elite Stiff

Stalk germplasm lines. Underlying QTLs together explained approximately 9.38% of the

phenotypic variance, adjusted for environmental effects. Several of these genomic regions are

novel, including QTLs in bins 9.02, 10.05 and four bins on chromosome 3. While previous

reports of in silico mapping utilized pedigree records to construct a relationship matrix for each

heterotic group, this study showed that GBS data can be used to create a genomic relationship

matrix that can be interchanged for the relationship matrix with near identical results. This

suggest that the relationship matrix can be made using GBS data alone, which is especially

beneficial for sets of germplasm with unknown or partially missing pedigree records.

Page 92: Analysis of Genotyping-by-Sequencing data in a maize (Zea

83

While previous studies have conducted in silico association mapping using hybrid data,

this study was the first to investigate QTL mapping using structured crosses. This allowed for

NCII ANOVA results to be cross-referenced with the results of the QTL mapping experiment.

The observed lack of QTLs detected in Iodent lines was consistent with the NCII ANOVA

analysis indicating that there were no significant additive effects for yield in the Iodent

background. The detection of significant associations in the Stiff Stalk germplasm lines was

consistent with the analysis reporting significant additive by density effects for yield. These

results suggest that an NCII ANOVA can be used as a screening tool, identifying germplasm

with insufficient additive genetic variation prior to conducting the QTL mapping experiment,

thus increasing the efficiency of the research.

Since the detected SNP markers are likely to be linked to each other and the underlying

QTL(s), they are more accurately described as detected LD regions. The next step with this

research is to construct the haplotypes of the Stiff Stalk germplasm, to identify that LD regions

containing the detected SNPs. This task requires ancestral pedigree information for the Stiff

Stalk lines, which is lacking in this study, as well as marker data for these ancestral lines. A

limitation of using GBS data is the prevalence of missing data, as missing data in one or both of

the parents at a loci could pose a challenge in the generation of haplotypes.

This study identified SNPs, likely belonging to LD regions, linked to QTLs for additive

genetic effects for impact of planting density on grain yield. Since one of the key traits maize

breeders have selected for in the past 100 years is density tolerance, it is reasonable to assume

that the maize genome contains signatures of selection, or selective sweeps, for density tolerance.

If signatures of selection for density tolerance have been detected in maize, these could be cross-

referenced with the results of this study to validate the genomic regions detected in this study.

Page 93: Analysis of Genotyping-by-Sequencing data in a maize (Zea

84

The decreasing cost of marker data and increased efficiency of computational analysis

allows for novel methods of basic and applied genetic research to be developed. As an alternative

to traditional QTL mapping approaches, this thesis demonstrates an approach with wide

applications for any breeding program utilizing NCII designs, allowing breeders to increase the

efficiency of their breeding programs by integrating QTL mapping with existing phenotypic data

from structured crosses. This method allows for cost-effective QTL discovery in elite germplasm

with results that are directly applicable to the breeding program.

Page 94: Analysis of Genotyping-by-Sequencing data in a maize (Zea

85

REFERENCES

Ajmone-Marsan, P., G. Monfredini, A. Brandolini, A.E. Melchinger, G. Garay and M. Motto.

1996. Identification of QTL for grain yield in an elite hybrid of maize: Repeatability of

map position and effects in independent samples derived from the same population.

Maydica 41:49-57.

Ajmone-Marsan, P., G. Monfredini, W.F. Ludwig, A.E. Melchinger, P. Franceschini, G.

Pagnotto, and M. Motto. 1995. In an elite cross of maize a major quantitative trait locus

controls one-fourth of the genetic variation for grain yield. Theoretical Applied Genetics

90:415-424.

Andorf, C.M., C.J. Lawrence, L.C. Harper, M.L. Schaeffer, D.A. Campbell, and T.Z. Sen. 2010.

The Locus Lookup tool at MaizeGDB: identification of genomic regions in maize by

integrating sequence information with physical and genetic maps. Bioinformatics 26:

434-436.

Austin, D.F. and M. Lee. 1996. Comparative mapping in F-2:3 and F-6:7 generations of

quantitative trait loci for grain yield and yield components in maize. Theoretical Applied

Genetics 92:817-826.

Badu-Apraku, B. 2010. Effects of recurrent selection for grain yield and Striga resistance in an

extra-early maize population. Crop Science 50:1735–1743.

Barata, C, and M. J. Carena. 2006. Classification of North Dakota maize inbred lines into

heterotic groups based on molecular and testcross data. Euphytica 151:339–349.

Bastian, M., S. Heymann, add M. Jacomy. 2009. Gephi: an open source software for

exploring and manipulating networks. International AAAI Conference on Weblogs and

Social Media.

Beavis, W.D., O.S. Smith, D. Grant, and R. Fincher. 1994. Identification of quantitative trait loci

using a small sample of topcrossed and F4 progeny from maize. Crop Science 34:882-

896.

Bernardo, R. 2001. Breeding potential of intra- and interheterotic group crosses in maize. Crop

Science 41:68–71.

Bernardo, R., J. Romero-Severson, J. Zieglem J. Hauser, L. Joe, G. Hookstra, and R.W. Doerge.

2000. Parental contribution and coefficient of coancestry among maize inbreds: pedigree,

RFLP, and SSR data. Theoretical Applied Genetics 100:552–556.

Bernardo, R., and J. Yu. 2007. Prospects for genomewide selection for quantitative traits in

maize. Crop Science 47:1082–1090.

Page 95: Analysis of Genotyping-by-Sequencing data in a maize (Zea

86

Bink, M.C.A.M., L.R. Totir, C.J.F. ter Braak, C.R. Winkler, M.P. Boer, and O.S. Smith. 2012.

QTL linkage analysis of connected populations using ancestral marker and pedigree

information. Theoretical and Applied Genetics 124: 1097–1113.

Birchler, J.A., D.L. Auger, and N.C. Riddle. 2003. In search of the molecular basis of heterosis.

The Plant Cell 15:2236-2239.

Boer, M.P., D. Wright, L.Z. Feng, D.W. Podlich, L. Luo, M. Cooper, and F.A. van Eeuwijk.

2007. A mixed-model quantitative trait loci (QTL) analysis for multiple-environment trial

data using environmental covariables for QTL-by-environment interactions, with an

example in maize. Genetics 177:1801–1813.

Bradbury, P.J., Z. Zhang, D.E. Kroon, T.M. Casstevens, Y. Ramdoss, and E.S. Buckler. 2007.

TASSEL: software for association mapping of complex traits in diverse samples.

Bioinformatics 23:2633-2635.

Bradley, J.P., K.H. Knittle, A.F. and Troyer. 1988. Statistical methods in seed corn production

selection. Journal of Production Agriculture 1: 34-38.

Bu, S.H., X. Zhao, C. Yi, J. Wen, J. Tu, and Y.M. Zhang. 2015. Interacted QTL mapping in

partial NCII design provides evidences for breeding by design. PLoS ONE 10:

e0121034.

Burt, A.J., C.M. Grainger, M.P. Smid, B.J. Shelp, and E.A. Lee. 2011. Allele mining in exotic

maize germplasm to enhance macular carotenoids. Crop Science 51:991-1004.

Cadzow, M., J. Boocock, H.T. Nguyen, P. Wilcox, T.R. Merriman, and M.A. Black. 2014. A

bioinformatics workflow for detecting signatures of selection in genomic data. Frontiers

in Genetics 5:293-300.

Civardi, L., Y. Xia, K.J. Edwards, P.S. Schnable, and B.J. Nikolau. 1994. The relationship

Between genetic and physical distances in the cloned a1-sh2 interval of the Zea mays L.

genome. PNAS 91:8268-8272.

Comstock, R.E., and H.F. Robinson. 1952. Estimation of average dominance of genes. p. 494-

516. In J.W. Gowen (ed.) Heterosis. Iowa State College Press, Ames.

Crepieux, S., C. Lebreton, B. Servin, and G. Charmet. 2004. Quantitative trait loci (QTL)

detection in multicross inbred designs: recovering QTL identical-by-descent status

information from marker data. Genetics 168: 1737–1749.

Crossa, J., G. de los Campos, P. Pérez, D. Gianola, J. Burgueño, J.L. Araus, D. Makumbi, R. P.

Page 96: Analysis of Genotyping-by-Sequencing data in a maize (Zea

87

Singh, S. Dreisigacker, J. Yan, V. Arief, M. Banziger, and H.-J. Braun. 2010. Prediction

of genetic values of quantitative traits in plant breeding using pedigree and molecular

markers. Genetics 186: 713–724.

Crossa, J., Y. Beyene, S, Kassa, P. Pérez, J.M. Hickey, C. Chen, G. de los Campos, J. Burgueño,

V.S. Windhausen, E. Buckler, J.-L. Jannink, M.A. Lopez Cruz, and R. Babu. 2013.

Genomic prediction in maize breeding populations with Genotyping-by-Sequencing. G3

(Bethesda) 3:1903-1926.

Crow, J.F. 1998. 90 Years ago: The beginning of hybrid maize genetics 148: 923–928.

Danecek, P., A. Auton, G. Abecasis, C.A. Albers, E. Banks, M.A. DePristo, R. Handsaker, G.

Lunter, G. Marth, S.T. Sherry, G. McVean, R. Durbin, and 1000 Genomes Project

Analysis Group. 2011. The Variant Call Format and VCFtools. Bioinformatics 27: 2156-

2158.

Darrah, L.L. and M.S. Zuber 1986. 1985 United States farm maize germplasm base and

commercial breeding strategies. Crop Science 26:1109–1113.

de Koning, D.J., R. Pong-Wong, L. Varona, G.J. Evans, E. Giuffra, A. Sanchez, G. Plastow, J.L.

Noguera, L. Andersson, C.S. Haley. 2003. Full pedigree quantitative trait locus analysis

in commercial pigs using variance components. Journal of Animal Science 81: 2155–

2163.

Duvick, D.N. 2001. Biotechnology in the 1930s: The development of hybrid maize. Nature

Reviews Genetics 2: 69–74.

Duvick, D.N. 2005. The contribution of breeding to yield advances in maize (Zea mays L.).

Advances in Agronomy 86: 83–145.

East, E.M. 1909. Inbreeding in corn, 1907. In: Connecticut Agric Exp Stn Rep, pp 419–428

Elshire R.J., J.C. Glaubitz, Q. Sun, J.A. Poland, K. Kawamoto, E.S. Buckler, and S.E. Mitchell.

2011. A robust, simple Genotyping-by-Sequencing (GBS) approach for high diversity

species. PLoS One 6:e19379.

Emik, L.O., and C.E. Terrill. 1949. Systematic procedures for calculating inbreeding

coefficients. Journal of Heredity 40: 51-55.

FAO. Food and Agriculture Organization of the United Nations statistics division.

Production/Crops. Latest update: 2015. Accessed January 2016. url:

http://faostat3.fao.org/home/E

Page 97: Analysis of Genotyping-by-Sequencing data in a maize (Zea

88

George, A. W., P.M. Visscher, and C.S. Haley. 2000. Mapping quantitative trait in complex

pedigrees: a two-step variance component approach. Genetics 156: 2081–2092.

Gilmour, A.R., B.J. Gogel, B.R. Cullis, and R. Thompson. 2009. ASReml User Guide Release

3.0. VSN International Ltd, Hemel Hempstead, HP1 1ES, UK www.vsni.co.uk

Grupe, A., S. Germer, J. Usuka, D. Aud, J. K. Belknap, R.F. Klein, M.K. Ahluwalia, R. Higuchi,

and G. Peltz. 2001. In silico mapping of complex disease-related traits in mice. Science

292: 1915–1918.

Guo, J., Z. Chen, Z. Liu, B. Wang, W. Song, W. Li, J. Chen, J. Dai, and J. Lai. 2011.

Identification of genetic factors affecting plant density response through QTL mapping of

yield component traits in maize (Zea mays L.). Euphytica 182: 409-422.

Hallauer, A., J. Miranda Filho, and M. Carena. 2010a. Quantitative genetics in maize breeding.

3rd ed. Iowa State Univ. Press, Ames, IA.

Hallauer, A.R., M.J. Carena, and J.M. Filho. 2010b. Hereditary variance: Experimental

estimates. In: Quantitative Genetics in Maize Breeding. Springer, New York, pp. 169-

222.

Hansey, C.N., J.M. Johnson, R.S. Sekhon, S.M. Kaeppler, and N. de Leon. 2011. Genetic

diversity of a maize association population with restricted phenology. Crop Science

51:704–715.

Hansey, C.N., B. Vaillancourt, R.S. Sekhon, N. de Leon, S.M. Kaeppler, and C.R. Buell. 2012.

Maize (Zea mays L.) genome diversity as revealed by RNA-sequencing. PLoS ONE

7:e33071.

Hayes, B.J., P.M Visscher, and M.E. Goddard. 2009. Increased accuracy of artificial selection by

using the realized relationship matrix. Genetics Research 91: 47–60.

Henderson, C.R. 1976. A simple method for computing the inverse of a numerator relationship

matrix used in prediction of breeding values. Biometrics 32: 69-83.

Ho, C., R. McCouch, and E. Smith. 2002. Improvement of hybrid yield by advanced backcross

QTL analysis in elite maize. Theoretical Applied Genetics 105:440-448.

Holtrop, A.T. Genetic architecture for yield potential, density tolerance, and yield stability in

maize (Zea mays L). MSc thesis, University of Guelph, 2016.

Page 98: Analysis of Genotyping-by-Sequencing data in a maize (Zea

89

Jones, D.F. 1918. The effects of inbreeding and crossbreeding on development. In: Conn Agric

Exp Stn Bull 207, pp 5–100.

Kozumplik, V., I. Pejic, L. Senior, R. Pavlina, G. Graham, and C.W. Stuber. 1996. Use of

molecular markers for QTL detection in segregating maize populations derived from

exotic germplasm. Maydica 41: 211-217.

Lee, E.A., and L.W. Kannenberg. 2004. Effect of inbreeding method and selection criteria on

inbred and hybrid performance. Maydica 49:191-197.

Lee, E.A. and M. Tollenaar. 2007. Physiological basis of successful breeding strategies for

maize grain yield. Crop Science 47:S202-S215.

Lee, E.A. M.J. Ash, and B. Good. 2007. Re-examining the relationship between degree of

relatedness, genetic effects and heterosis in maize. Crop Science 47:629-635.

Lee, E.A., A. Singh, M.J. Ash, and B. Good. 2006. Use of sister-lines and the performance of

modified single-cross maize hybrids. Crop Science 46:312-320.

Li, C., Y. Li, P.J. Bradbury, X. Wu, Y. Shi, Y. Song, D. Zhang, E. Rodgers-Melnick, E.S.

Buckler, Z. Zhang, Y. Li, and T. Wang. 2015a. Construction of high-quality

recombination maps with low-coverage genomic sequencing for joint linkage analysis in

maize. BMC Biology 13:78-89.

Li, Y.-X., X. Wu, J. Jaqueth, D. Zhang, D. Cui, C. Li, G. Hu, H. Dong, Y.C. Song, Y.-S. Shi, T.

Wang, B. Li, and Y. Li. 2015b. The identification of two head smut resistance-related

QTL in maize by the joint approach of linkage mapping and association analysis. PLoS

One 10: e0145549.

Liu, K., M. Goodman, S. Muse, J.S. Smith, E. Buckler, and J. Doebley. 2003. Genetic

structure and diversity among maize inbred lines as inferred from DNA microsatellites.

Genetics 165: 2117–2128.

Liu, C., Z. Hao, D. Zhang, C. Xie, M. Li, X. Zhang, H. Yong, S. Zhang, J. Weng, and X. Li.

2015. Genetic properties of 240 maize inbred lines and identity-by-descent segments

revealed by high-density SNP markers. Molecular Breeding 35:146.

Malosetti, M., J.M. Ribaut, M. Vargas, J. Crossa, and F.A. van Eeuwijk. 2008. A multi-trait

multi-environment QTL mixed model with an application to drought and nitrogen stress

trials in maize (Zea mays L.). Euphytica 161:241–257.

Melchinger, A.E., H.F. Utz and C.C. Schön. 1998. Quantitative trait locus (QTL) mapping using

different testers and independent population samples in maize reveals low power of QTL

detection and large bias in estimates of QTL effects. Genetics 149:383-403.

Page 99: Analysis of Genotyping-by-Sequencing data in a maize (Zea

90

Messmer, R., Y. Fracheboud, M. Banziger, M. Vargas, P. Stamp, and J. Ribaut. 2009. Drought

stress and tropical maize: QTL-by-environment interactions and stability of QTLs across

environments for yield components and secondary traits. Theoretical Applied

Genetics 119:913-930.

Meuwissen, T.H.E., B.J. Hayes, and M.E. Goddard. 2001. Prediction of total genetic values

Using genome-wide dense marker maps. Genetics 157:1819–1829.

Mezmouk, S. and J. Ross-Ibarra. 2014. The pattern and distribution of deleterious mutations in

maize. G3 (Bethesda) 4: 163–171.

Mikel, M.A. and J.W. Dudley. 2006. Evolution of North American dent corn from public

proprietary germplasm. Crop Science 46: 1193- 1205.

Mikel, M.A. 2008. Genetic diversity and improvement of contemporary proprietary North

American dent corn. Crop Science 48: 1686-1695.

Montgomery, E.G. 1916. The corn crops. Macmillan, New York.

Moose, S.P., and R.H. Mumm. 2008. Molecular plant breeding as the foundation for 21st

century crop improvement. Plant Physiology 147:969–977.

Mumm, R.H., L.J. Hubert, and J.W. Dudley. 1994. A classification of 148 U.S. maize inbreds: II.

Validation of cluster analysis based on RFLPs. Crop Science 34:852-865.

Nelson, P.T., N.D. Coles, J.B. Holland, D.M. Bubeck, S. Smith, and M.M. Goodman. 2008.

Molecular characterization of maize inbreds with expired U.S. Plant Variety Protection.

Crop Science 48:1673-1685.

Nelson, B.K., A.L. Kahler, J.L. Kahler, M.A. Mikel, S.A. Thompson, R.S. Ferriss, S. Smith, and

E.S. Jones. 2011. Evaluation of the numbers of single nucleotide polymorphisms required

to measure genetic distance in maize (Zea mays L.). Crop Science 51: 1470-1480.

Nikolić, A., D. Ignjatović-Micić, D. Dodig, V. Anđelković, and V.Lazić-Jančić. 2012.

Identification of QTLs for yield and drought-related traits in maize: assessment of their

causal relationships. Biotechnology and Biotechnological Equipment 26:2952-2960.

Parisseaux, B. and R. Bernardo. 2004. In silico mapping of quantitative trait loci in maize.

Theoretical and Applied Genetics 109: 508–514.

Piepho, H. P., 2009 Ridge regression and extensions for genomewide selection in maize. Crop

Science 49: 1165–1176.

Page 100: Analysis of Genotyping-by-Sequencing data in a maize (Zea

91

Pryce, J.E., B.J. Hayes, and M. E. Goddard. 2012. Novel strategies to minimize progeny

inbreeding while maximizing genetic gain using genomic information. Journal of Dairy

Science 95: 377–388.

Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M.A.R. Ferreira, D. Bender, J. Maller, P.

Sklar, P.I.W. De Bakker, M.J. Daly, and P.C. Sham. 2007. PLINK: a toolset for whole-

genome association and population-based linkage analysis. American Journal of Human

Genetics 81:559-575.

R Core Team. 2015. R: A language and environment for statistical computing. R Foundation for

Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

Ribaut, J.-M., C. Jiang, D. Gonzalez-de-Leon, G. O. Edmeades, and D. A. Hoisington. 1997.

Identification of quantitative trait loci under drought conditions in tropical maize. 2.

Yield components and marker-assisted selection strategies. Theoretical Applied Genetics

94:887-896.

Rojas, B.A. and G.F. Sprague. 1952. A comparison of variance components in corn yield trials:

III. General and specific combining ability and their interaction with locations and years.

Agronomy Journal 44:462-466.

Romay, M.C., R.A. Malvar, L. Campo, A. Álvarez, J. Moreno-González, A. Ordás, and P.

Revilla. 2010. Climatic and genotypic effects for grain yield in maize under stress

conditions. Crop Science 50: 51–58.

Romay, M.C., M.J. Millard, J.C. Glaubitz, J.A. Peiffer, K.L. Swarts, T.M. Casstevens, R.J.

Elshire, C.B. Acharya, S.E. Mitchell, S.A. Flint-Garcia, M.D. McMullen, J.B. Holland,

E.S. Buckler, and C.A. Gardner. 2013. Comprehensive genotyping of the USA national

maize inbred seed bank. Genome Biology 14:R55.

SAS/STAT software, Version 9.4 of the SAS System for Unix. Copyright © 2002-2012 SAS

Institute Inc.

Shull, G.H. 1908. The composition of a field of maize. Amer Breeders’ Assoc Rep 4:296–301

Shull, G.H. 1909. A pureline method of corn breeding. Amer Breeders’ Assoc Rep 5:51–59

Silva, F.F.E., J.M.S Viana, V.R. Faria, and M.D.V. de Resende. 2013. Bayesian inference of

mixed models in quantitative genetics of crop species. Theoretical and Applied Genetics

126:1749–1761.

Stuber, C.W., S.E. Lincoln, D.W. Wolff, T. Helentjaris, and E.S. Lander. 1992. Identification of

Page 101: Analysis of Genotyping-by-Sequencing data in a maize (Zea

92

genetic factors contributing to heterosis in a hybrid from two elite maize inbred lines

using molecular markers. Genetics 132:823-839.

Tanksley, S.D., 1993. Mapping polygenes. Annual Review of Genetics 27: 205–233.

Tollenaar, M. and E.A. Lee. 2002. Yield potential, yield stability and stress tolerance in maize.

Field Crops Research 75: 161-169.

Troyer, A.F. 1999. Background of U.S. hybrid corn. Crop Science 39: 601–626.

Troyer, A.F. and M.A. Mikel. 2010. Minnesota corn breeding history: Department of Agronomy

& Plant Genetics Centennial. Crop Science 50: 1141–1150.

van Eeuwijk, F.A., M. Boer, L.R. Totir, M. Bink, D. Wright, C.R. Winkler, D. Podlich, K.

Boldman, A. Baumgarten, M. Smalley, M. Arbelbide, C.J.F. ter Braak, and M. Cooper.

2010. Mixed model approaches for the identification of QTLs within a maize hybrid

breeding program. Theoretical and Applied Genetics 120: 429-440.

VanRaden, P.M. 2008. Efficient methods to compute genomic predictions. Journal of Dairy

Science, 91: 4414-4423.

Wolfe, D., S. Dudek, M.D. Ritchie, and S.A. Pendergrass. 2013. Visualizing genomic

Information across chromosomes with PhenoGram. BioData Mining 6:18-29.

Wu, Y., F.S. Vicente, K. Huang, T. Dhliwayo, D.E. Costich, K. Semagn, N. Sudha, M. Olsen,

B.M. Prasanna, X. Zhang, and R. Babu. 2016. Molecular characterization of CIMMYT

maize inbred lines with genotyping‑by‑sequencing SNPs. Theoretical Applied Genetics

29:753–765.

Yu, J., M. Arbelbide, and R. Bernardo. 2005. Power of in silico QTL mapping from phenotypic,

pedigree, and marker data in a hybrid breeding program. Theoretical and Applied

Genetics 110: 1061–1067.

Zhang, Y.-M., Y. Mao, C. Xie, H. Smith, L. Luo, and S. Xu. 2005. Mapping quantitative trait

loci using naturally occurring genetic variance among commercial inbred lines of maize

(Zea mays L.). Genetics 169: 2267–2275.

Zhang, X., P. Pérez-Rodríguez, K. Semagn, Y. Beyene, R. Babu, M.A. López Cruz, F. San

Vicente, M. Olsen, E. Buckler, J.L. Jannink, B.M. Prasanna, and J. Crossa. 2015.

Genomic prediction in biparental tropical maize populations in water-stressed and well-

watered environments using low-density and GBS SNPs. Heredity 114:291–299.

Page 102: Analysis of Genotyping-by-Sequencing data in a maize (Zea

93

Zhou, Z. C. Zhang, Y. Zhou, Z. Hao, Z. Wang, X. Zeng, H. Di, M. Li, D. Zhang, H. Yong, S.

Zhang, J. Weng, and X. Li. 2016. Genetic dissection of maize plant architecture with an

ultra-high density bin map based on recombinant inbred lines. BMC Genomics 17: 178-

192.

Zhu, C., M. Gore, E.S. Buckler, and J. Yu. 2008. Status and prospects of association mapping in

plants. The Plant Genome 1:5–20.

Page 103: Analysis of Genotyping-by-Sequencing data in a maize (Zea

94

SUPPLEMENTAL TABLES AND FIGURES FOR CHAPTER 4: IN SILICO MAPPING

OF MAIZE GRAIN YIELD QTLS USING STRUCTURED CROSSES

Supplementary Table 4.1. Position, Likelihood ratio test statistic (LRT) and FDR adjusted q-

value for significant SNPs found in Stiff Stalk germplasm. Gene models containing a SNP are

listed.

chromosome position (bp)

According to

refgen v2.

LRT FDR q-

value

Gene models from Maize GBD

1 66,975,722 20.7 0.0012

1 67,077,524 20.7 0.0012

1 67,440,505 20.7 0.0012

1 67,464,758 20.7 0.0012

1 67,480,447 20.7 0.0012

1 67,993,826 20.7 0.0012

1 68,439,791 20.7 0.0012

1 68,439,797 20.7 0.0012

1 68,561,821 20.7 0.0012

1 70,247,320 20.7 0.0012

1 72,022,649 20.7 0.0012

1 276,818,314 20.7 0.0012 GRMZM2G421491/ZEAMMB73_9227

06 (gsht1 - glutathione transporter1)

3 6,976,782 20.7 0.0012

3 18,184,709 24.12 0.0012

3 18,999,799 20.7 0.0012

3 20,400,067 20.7 0.0012

3 20,417,559 20.7 0.0012

3 20,417,562 20.7 0.0012

3 21,400,755 20.7 0.0012

3 22,250,501 20.7 0.0012

3 22,570,789 20.7 0.0012

3 22,944,442 20.7 0.0012

3 23,009,927 20.7 0.0012

3 23,009,945 20.7 0.0012

3 23,009,951 20.7 0.0012

3 23,012,358 20.7 0.0012

3 26,308,887 20.7 0.0012

3 27,552,196 20.7 0.0012 GRMZM2G026643/ZEAMMB73_8173

67 (ocl1 - outer cell layer1)

Page 104: Analysis of Genotyping-by-Sequencing data in a maize (Zea

95

3 27,552,199 20.7 0.0012

3 141,359,601 20.7 0.0012 GRMZM2G101020/ZEAMMB73_4291

99 (atp1 - ATPase1)

3 141,444,450 20.7 0.0012 GRMZM2G067171/ZEAMMB73_5826

36 (gata31 - C2C2-GATA-transcription

factor 31)

3 142,253,006 20.7 0.0012

3 142,261,549 20.7 0.0012

3 147,292,600 20.7 0.0012

3 147,292,623 20.7 0.0012

3 147,292,637 20.7 0.0012

3 148,291,030 20.7 0.0012

3 148,291,283 20.7 0.0012

3 153,303,132 20.7 0.0012

3 153,764,074 20.7 0.0012

3 154,124,779 20.7 0.0012

3 154,373,563 20.7 0.0012

3 154,373,583 20.7 0.0012

3 154,373,594 20.7 0.0012

3 156,275,421 20.7 0.0012

3 157,085,834 20.7 0.0012

3 211,719,240 20.7 0.0012

5 18,276,882 20.7 0.0012

5 18,276,955 20.7 0.0012

5 18,276,959 20.7 0.0012

5 18,276,960 20.7 0.0012

9 18,702,410 20.7 0.0012

9 18,765,811 20.7 0.0012

10 10,138,359 20.7 0.0012 GRMZM2G068710/ZEAMMB73_4177

49 (c2h35 - C2H2-transcription factor

235)

10 10,138,363 20.7 0.0012

10 10,170,505 20.7 0.0012

10 10,170,516 20.7 0.0012

10 10,196,528 13.62 0.0478

10 10,196,543 20.7 0.0012

10 10,201,182 20.7 0.0012

10 10,201,183 20.7 0.0012

10 10,201,193 20.7 0.0012

Page 105: Analysis of Genotyping-by-Sequencing data in a maize (Zea

96

10 10,245,870 13.62 0.0478

10 10,603,533 20.7 0.0012

10 10,615,471 20.7 0.0012

10 10,620,573 20.7 0.0012

10 10,667,000 20.7 0.0012

10 10,667,383 20.7 0.0012

10 10,829,394 20.7 0.0012

10 10,829,400 20.7 0.0012

10 14,001,763 20.7 0.0012

10 14,445,038 20.7 0.0012 GRMZM2G083347/ZEAMMB73_3913

58 (nactf67 - NAC-transcription factor

67)

10 14,446,091 20.7 0.0012

10 14,446,149 20.7 0.0012

10 14,764,622 20.7 0.0012

10 14,908,455 24.12 0.0012

10 15,208,871 20.7 0.0012

10 15,208,872 20.7 0.0012

10 15,237,302 20.7 0.0012

10 15,393,034 20.7 0.0012

10 15,549,495 24.12 0.0012

10 16,031,237 20.7 0.0012

10 16,031,241 20.7 0.0012

10 16,213,666 20.7 0.0012

10 16,354,706 20.7 0.0012

10 16,354,729 20.7 0.0012

10 16,372,589 20.7 0.0012

10 16,372,595 20.7 0.0012

10 79,940,553 20.7 0.0012

10 79,987,079 20.7 0.0012

10 80,042,615 20.7 0.0012

10 80,042,635 20.7 0.0012

10 80,048,193 20.7 0.0012

10 80,492,532 20.7 0.0012

10 80,576,147 20.7 0.0012

10 83,456,644 20.7 0.0012

10 83,456,689 20.7 0.0012

Page 106: Analysis of Genotyping-by-Sequencing data in a maize (Zea

97

10 85,394,188 20.7 0.0012 GRMZM2G001824/ZEAMMB73_8913

77 (myb134 - MYB-transcription factor

134)

10 86,822,065 20.7 0.0012

10 87,419,786 20.7 0.0012 GRMZM2G036554/ZEAMMB73_6532

83 (bhlh153 - bHLH-transcription factor

153)

10 91,960,025 20.7 0.0012

10 91,960,027 20.7 0.0012

10 91,960,032 20.7 0.0012

10 92,033,761 20.7 0.0012

10 93,140,324 20.7 0.0012

10 93,140,350 20.7 0.0012

10 93,140,365 20.7 0.0012

10 95,232,647 20.7 0.0012

10 95,232,657 20.7 0.0012

10 97,372,975 20.7 0.0012

10 104,831,710 20.7 0.0012

10 105,281,679 20.7 0.0012

10 110,205,731 20.7 0.0012

10 110,205,733 20.7 0.0012

10 110,298,066 20.7 0.0012

10 110,298,193 20.7 0.0012

10 110,298,195 20.7 0.0012

10 111,426,216 20.7 0.0012

10 132,245,294 20.7 0.0012

10 132,533,341 20.7 0.0012

10 132,533,404 20.7 0.0012

10 132,533,405 20.7 0.0012

10 132,533,406 20.7 0.0012