Supplementary Materials for - Science · 16.07.2014 · First, we carried out consensus the sequence correction using a homemade program 61 that maps reads on the reference 3B sequence

www.sciencemag.org/cgi/content/full/345/6194/1249721/suppl/DC1

Supplementary Materials for

Structural and functional partitioning of bread wheat chromosome 3B

Frédéric Choulet,* Adriana Alberti, Sébastien Theil, Natasha Glover, Valérie Barbe, Josquin Daron, Lise Pingault, Pierre Sourdille, Arnaud Couloux, Etienne Paux,

Philippe Leroy, Sophie Mangenot, Nicolas Guilhot, Jacques Le Gouis, Francois Balfourier, Michael Alaux, Véronique Jamilloux, Julie Poulain,

Céline Durand, Arnaud Bellec, Christine Gaspin, Jan Safar, Jaroslav Dolezel, Jane Rogers, Klaas Vandepoele, Jean-Marc Aury, Klaus Mayer, Hélène Berges,

Hadi Quesneville, Patrick Wincker, Catherine Feuillet

*Corresponding author. E-mail: [email protected]

Published 18 July 2014, Science 345, 1249721 (2014)

DOI: 10.1126/science.1249721

This PDF file includes:

Materials and Methods Figs. S1 to S11 Tables S1 to S14 References

Supplementary text

Table of contents:

1. Sequencing, assembly, scaffolding, curation, and construction of a pseudomolecule ........ 4

1.1. Sequencing .................................................................................................................... 4

1.1.1. BAC DNA extractions and pool preparation ......................................................... 4

1.1.2. Roche/454 paired-end library preparation and sequencing.................................... 5

1.1.3. Illumina sequencing of flow-sorted chromosome 3B ............................................ 6

1.2. Read assembly and scaffolding .................................................................................... 6

1.3. Scaffolding curation ..................................................................................................... 7

1.4. Assignment of scaffolds to physical contigs ................................................................ 8

1.5. Removal of redundant sequences ................................................................................. 9

1.6. Construction of pseudomolecule from chromosome 3B ............................................ 11

1.6.1. SNP marker development via SureSelect Target Enrichment ............................. 11

1.6.2. Ordering and orientating scaffolds along the chromosome ................................. 13

2. Sequence annotation .......................................................................................................... 14

2.1. Gene annotation using an improved version of the TriAnnot pipeline ...................... 14

2.1.1. Training ab-initio predictors ................................................................................ 14

2.1.2. Gene modeling and definition of gene islands ..................................................... 14

2.1.3. Classification of gene, pseudogenes, and gene fragments ................................... 15

2.1.4. Content in protein-coding genes and estimation of the prediction accuracy ....... 16

1

2.1.5. Using gene models to estimate the quality of wheat whole genome and

chromosome-based shotgun assemblies............................................................................ 17

2.1.6. GO term annotation and enrichment analysis ...................................................... 17

2.1.7. Identification of genes putatively involved in resistance to pathogens ............... 18

2.1.8. Non-coding RNA gene predictions ...................................................................... 19

2.2. TE annotation ............................................................................................................. 20

2.2.1. Classification of a library of Triticeae transposable element sequences for

similarity-based annotation ............................................................................................... 20

2.2.2. Development of ClariTE for high quality automated annotation of TEs ............. 21

2.2.3. Solo LTR annotation ............................................................................................ 22

2.2.4. De novo repeat identification ............................................................................... 22

2.3. Distribution of TEs ..................................................................................................... 23

2.3.1. Definition of the centromere ................................................................................ 23

2.3.2. Pattern of TE distribution ..................................................................................... 24

2.3.3. Relative proportion of TE in the vicinity of genes .............................................. 24

2.4. TE insertion time ........................................................................................................ 25

3. Expression analyses ........................................................................................................... 25

3.1. Sample preparation and sequencing ........................................................................... 25

3.2. Read mapping, expression analysis, and detection of alternative splicing ................. 26

3.3. Segmentation/change-point analysis .......................................................................... 26

4. Chromosome partitioning in maize and barley ................................................................. 27

4.1. Partitioning of the maize chromosomes ..................................................................... 27

2

4.2. Partitioning of the barley chromosomes ..................................................................... 28

5. Comparative analyses ........................................................................................................ 29

5.1. Identification of syntenic and nonsyntenic genes on chromosome 3B....................... 29

5.2. Collinearity ................................................................................................................. 30

5.3. Intra- and Inter-chromosomal duplications ................................................................ 30

5.4. Calculation of synonymous (Ks) and nonsynonymous (Ka) substitution rates .......... 32

6. Construction of a genetic map and LD mapping ............................................................... 32

6.1. Plant material and DNA extraction ............................................................................ 32

6.2. Genetic mapping ......................................................................................................... 33

6.3. Linkage Disequilibrium (LD) mapping ...................................................................... 34

7. metaQTL analysis and projection of the confidence intervals on the 3B pseudomolecule

34

3

1. Sequencing, assembly, scaffolding, curation, and construction of a pseudomolecule

1.1. Sequencing

1.1.1. BAC DNA extractions and pool preparation

We used the last version of the 3B physical map (4) to select a minimal tiling path (MTP) for

sequencing. To avoid redundancy, we discarded all contigs containing less than 5 BACs as

they mostly correspond to contigs with low quality fingerprints that carry regions already

present in larger contigs. This resulted in a MTP of 8,452 BAC clones assembled into 1,282

BAC contigs. DNA from all 8,452 BAC clones was extracted individually by an adapted

alkaline lysis method that enabled limiting contamination by E. coli genomic DNA to around

10%. Extractions were performed on 96-well plates. Inoculation and DNA extraction were

repeated 12 times for each clone using 12 independent plates in order to reach a sufficient

amount of DNA for downstream library preparation. Each BAC clone was incubated in 1.2 ml

2YT (+ 12.5 µg/ml chloramphenicol) on 96-well plate for 20 hours at 525 rpm at 37°C. After

centrifugation for 10 min at 4°C, the pellets were re-suspended in 200 µl prechilled P1 buffer

(Qiagen, Hilden, Germany) containing RNAse (50 µg/ml). Then, 200 µl P2 buffer (Qiagen)

were added and the plate content was mixed by a fivefold gentle turnover. Plates were left 4

minutes on ice and 200 µl pre-chilled P3 buffer (Qiagen) were added. After a gentle fivefold

turnover mixing, the plates were incubated on ice during 10 minutes followed by 30 min

centrifugation at 4100 rpm at 4°C. The supernatant was transferred to clean plates and 350 µl

of isopropanol were added and gently mixed before centrifugation for 30 min at 4000 rpm at

room temperature. Pellets were washed twice using 400 µl 70% ethanol, vacuum dried, and

dissolved in 25 µl TE buffer (4:0.2). One 96 plate was used for BAC-end sequencing by the

Sanger method in order to control the correct BAC position on the plates and to obtain BAC-

end sequence (BES) information for further sequence assembly. Following DNA extraction,

4

the content of the wells corresponding to the same BAC clone were pooled together. After

pooling, each extraction was quantified by Fluoroskan microplate fluorometer measurement.

BAC pools were created as follows using information from the MTP: equimolar amounts of

10 different BACs or less belonging or not to the same physical contig were mixed to create

DNA pools containing at least 15 µg DNA. In total, 922 BAC pools were created while

optimizing the pooling of overlapping BACs.

1.1.2. Roche/454 paired-end library preparation and sequencing

The BAC pools were used to construct 922 mate-pair libraries of 8 kb insert size following a

slightly modified Roche/454 protocol. Briefly, 15 µg of each BAC pool were sheared to about

8 kb, end repaired with the END-it-Repair kit (Epicentre), and ligated to biotinylated loxP

adaptors (Roche). After gel size selection of 8 kb bands and fill in, 300 ng DNA were

circularized by the Cre recombinase and remaining linear DNA was digested by the Plasmid

Safe ATP dependent DNAse (Epicentre) and exonuclease I. Circular DNA was fragmented by

Covaris (Covaris Inc., USA) shearing and biotinylated fragments were immobilized on

streptavidin beads. As the original Roche/454 protocol did not allow creating barcoded mate

pair libraries, modified barcoded adapters were used for subsequent ligation and then the

library was prepared following the Roche/454 protocol without further modifications. After

library quantification by qPCR, emulsion PCRs were performed on pools of three libraries.

Six libraries were then loaded on one PTP and pyrosequenced using the GS FLX Titanium

Instrument (Roche) according to the manufacturer protocol, in order to obtain at least 20 Mb

of sequence per pool.

5

1.1.3. Illumina sequencing of flow-sorted chromosome 3B

Approximately 20,000 copies of chromosome 3B were flow-sorted in three batches. Their

DNA was purified and multiple-displacement amplified (MDA) by illustra GenomiPhi V2

DNA Amplification Kit (GE Healthcare, Piscataway, USA) according to Simkova et al. (58).

DNA samples obtained in three independent amplifications were pooled and 1 µg of DNA

was sonicated to a 150- to 600-bp size range using the E210 Covaris instrument (Covaris,

Inc., USA). Fragments were end-repaired and 3’-adenylated before Illumina adapters were

added using the NEBNext Sample Reagent Set (New England Biolabs). Ligation products of

300-600 bp were gel-purified and size-selected. DNA fragments were PCR-amplified using

Illumina adapter-specific primers. After library profile analysis by the Agilent 2100

Bioanalyzer (Agilent Technologies, USA) and qPCR quantification, the library was

sequenced using 100 base-length read chemistry in paired-end flow cell on the Illumina

HiSeq2000 (Illumina, USA). This generated 82 Gb of sequences.

1.2. Read assembly and scaffolding

Given the specificity of the sequencing strategy, an automated assembly pipeline was

developed for the project. First, the pooled BAC sequences were evaluated and cleaned for

quality and contamination by E. coli genomic and vector sequences using SOAP2 aligner

(59). Then, the sequences from each BAC-pool were assembled into scaffolds using the

Newbler assembler from Roche (version MapAsmResearch-04/19/2010-patch-08/17/2010,

http://www.454.com). BESs produced for all the clones were used to check the presence of

the expected BAC sequence in each pool using BLAST (60).

6

1.3. Scaffolding curation

The first BAC pool sequence assemblies resulted in 16,136 scaffolds (293,806 contigs) with a

N50 of 275 kb and an average of 12.5 scaffolds per physical contigs. However, the contig

N50 was of 12 kb only, meaning that, although the scaffolds were large, they comprised many

small contigs, because of a high amount of repeated DNA (mainly long terminal repeats of

retrotransposons). To improve the accuracy of the assembly and scaffolding, we developed a

pipeline based, first, on mining information provided by Newbler about the positions of paired

reads in the contigs to validate and improve the scaffolding computed by Newbler. In a

second step, the pipeline integrates data from the Illumina chromosome 3B shotgun and BAC-

end sequencing. Based on the two output files “454PairStatus.txt”, containing all the

information regarding the read position in the assembly, and “454Scaffolds.txt”, presenting

the contig organization on the scaffold, we developed a program pinpointing potential

scaffolding errors. In addition, a module was developed to provide information to the curator

about the potential introduction of previously unplaced contigs into scaffold (for gap filling)

based on read pair data. Finally, the decision to correct the scaffolding was taken by a curator

after viewing and inspecting the assembly. A corrected version of the file dedicated to

scaffold organization (“454Scaffolds.txt”) was produced by the curator and used to create

automatically a corrected sequence.

The second step aimed at ensuring sequence accuracy that takes into account the high error

rate at homopolymer sites observed with pyrosequencing technology. In addition, manual

curation did not allow decreasing the number of gaps in the assembly. To address these two

issues, we used the 82 Gb of Illumina reads produced from DNA of flow-sorted chromosome

3B. First, we carried out the consensus sequence correction using a homemade program (61)

that maps reads on the reference 3B sequence using BWA (Burrows-Wheeler Aligner,

http://bio-bwa.sourceforge.net/), detects the variations based on the read quality and the

7

http://bio-bwa.sourceforge.net/

coverage at each nucleotide position, and finally corrects the reference sequence. Then, using

the reads mapped close to gaps in the reference sequence, we used Gapcloser

(http://soap.genomics.org.cn/) to extend sequence edges, estimate with more precision the gap

size or fully fill the gap when possible. In total, 126,290 bases were corrected and 109,914

gaps were filled. These finishing steps resulted in a total of 5,109 scaffolds with a N50 of

463 kb and an average of 4 scaffolds per physical contigs (Table S1).

1.4. Assignment of scaffolds to physical contigs

The 5,109 scaffolds resulting from the finishing steps were used to build a pseudomolecule of

chromosome 3B. At that stage, 110 small scaffolds were discarded because they were

potentially originating from bacterial DNA contamination, leading to a set of 4,999 scaffolds

(Table S1). Although the pooling scheme was optimized to maximize the pooling of

overlapping BACs originating from the same physical contig, 477 of the 922 initial pools

(52%) contained BACs originating from more than one physical contig (up to 5). As all reads

from a given BAC-pool shared the same barcode, additional information was required to

assign the assembled scaffolds to their physical contigs of origin. For that, we searched for

sequence identity between scaffolds and all available sequence tags that were assigned to

individual BACs. Two sources of tags were used: 42,551 non-redundant BESs and 327,282

Whole Genome Profiling (WGP) tags generated by (62) on a subset of the 3B physical map.

This allowed us to assign 96% of the sequence to a contig of the physical map. The remaining

4% correspond to small scaffolds that did not match any BES or tag. Homemade scripts

(available upon request) and a relational MySQL database, connecting the data of the physical

map, tags, pools, and sequences, were developed to retrieve BESs and WGP tags expected to

be present in the assembly of each pool.

8

http://soap.genomics.org.cn/

1.5. Removal of redundant sequences

Sequencing individual BAC pools raises the problem of sequence redundancy in the assembly

due to the fact that the same genomic locus can be assembled twice, independently, as it was

contained in overlapping BACs that were sequenced in different pools. This redundancy

needs to be distinguished from truly duplicated regions because it would limit our ability to

understand the role of duplications in the wheat genome evolution. A solution for that would

have been to perform a unique assembly run with the full set of reads coming from several

BAC pools. However, in that case, the size of the region assembled and the proportion of

repeated kmers increase, leading to lower the accuracy of the assembly. Therefore, we chose

to assemble each pool separately and remove the redundancy afterwards. Removing

redundancy could not be done using any of the available read assemblers since scaffolds are

large, repeat-rich, contain gaps of variable size, and, therefore, are not suitable for classical

available tools. Thus, we developed a new program called scaffAssembler that uses BLAT

(63) to assemble scaffolds through iterative pairwise alignments. It performs an all-by-all

comparison of a series of scaffolds and then parses the alignments to capture the presence of

redundant sequences. Three different cases were distinguished: 1) identical scaffolds: when

two scaffolds overlap over their full length; 2) included scaffolds: when a scaffold is fully

identical to a part of a larger one in the dataset; and 3) overlapping scaffolds: when two

scaffolds share identical sequences encompassing their extremities. The criteria used to

consider that 2 scaffolds share a redundant sequence are the following: at least 10 kb

contiguous sequences sharing at least 99% nucleotide identity. When identical scaffolds were

found, one was randomly discarded. For a scaffold fully included into another larger one, the

smaller scaffold was discarded. In the case of overlapping scaffolds, scaffAssembler

computed the assembly by retrieving the coordinates of the matching segments (found by

BLAT with the extendThroughN parameter) and, compared the gap content of the 2

9

redundant segments in order to discard the one with the highest amount of Ns. This procedure

was applied iteratively by pairwise alignments until every case of redundancy was treated

within the considered set the scaffolds.

Two types of sequence redundancies were distinguished and treated separately: “expected”

versus “unexpected” based on the physical map information, i.e. the overlaps between BACs

predicted through comparisons of their fingerprints. First, the expected redundancy

corresponds to every pair of known overlapping BACs that are split into two different

sequencing pools. In total, 521 out of the 922 initial BAC-pools (57%) contain overlapping

MTP BACs. For the expected redundancy, scaffold-based assembly was applied directly with

scaffAssembler on targeted BAC-pools. This led to a decrease in the number of scaffolds

from 4999 to 4747 and to removal of 11.7 Mb (1.2%). However, the major part of redundancy

was in fact unexpected in the physical map. It comes from four different sources: 1) some

BAC-contigs overlapped but were assembled into 2 separated contigs because the local

coverage (i.e. number of BACs covering a locus) was too low for joining with FPC (64) (such

cases were expected especially because the physical map of chromosome 3B was assembled

at high stringency (1e-45) to prevent from chimerical BAC-contigs); 2) some BAC-contigs

may be redundant but were mistakenly assembled into separated contigs by FPC; 3) BACs

may be misassembled by FPC (fingerprint assembly errors) i.e. they do not belong to the

predicted BAC-contig; and 4) some wells might be contaminated with other clones. In the

latter two cases, scaffolds originating from misassembled or contaminated BACs are fully

redundant because they carry a genomic locus sequenced in another BAC-pool.

In contrast to the expected redundancy, solving the unexpected redundancy required an all-by-

all comparison of the full set of scaffolds. As ca. 85% of the sequence is made of repeated

elements, all-by-all alignments of 1 Gb of sequence is computationally very intensive and

therefore, we developed a two-step strategy in which we first used the assignment of BAC-

10

contigs to 8 deletion bins in order to divide the chromosome in 8 fractions. All-by-all

comparisons limited to scaffolds belonging to the same deletion bin were applied using

iterative runs of scaffAssembler. This led to a decrease in the number of scaffolds to 3,769

and resulted in discarding 59.7 Mb (6%) of redundant sequences. Then, we developed a

strategy based on determining shared TE-junctions between scaffolds to identify potentially

redundant sequences. Indeed, junctions between nested transposons are extremely abundant

and mainly unique in the genome and thus are frequently used as specific molecular markers

(Insertion Site Based Polymorphism, ISBP (50)). They can be used in-silico as a signature to

identify redundancies i.e. scaffolds sharing the same set of ISBPs. In total, 166,385 ISBPs

were predicted along the 3,769 scaffolds with isbpFinder (50) and pairs of scaffolds sharing

identical junctions were detected. Hence, instead of aligning the full dataset, we focused on

38 Mb of TE junctions. With this approach, we identified 2136 scaffolds clustered into 349

groups of potentially overlapping scaffolds. Their assembly decreased to 2,827 the number of

scaffolds and discarded 88.0 Mb (9.6%) of redundancy. Since a 10 kb length threshold was

applied to consider redundant regions, it was not possible to discard small redundant

scaffolds. However, 19 small scaffolds that carry a redundant copy of a gene predicted on

another scaffold (>99% identity over the full gene length + 500 surrounding bps) were

identified and discarded. This led to a final number of 2,808 scaffolds representing 833 Mb

(Table S1).

1.6. Construction of pseudomolecule from chromosome 3B

1.6.1. SNP marker development via SureSelect Target Enrichment

Genomic DNA was extracted from 10 wheat accessions (Chinese Spring, Renan, Preimio,

Robigus, Xi19, Apache, Aztec, Autan, Cezanne and Uli3) following the protocol described by

Graner et al. (65). Sequence capture was performed on these 10 lines using the SureSelect

11

Target Enrichment System for Illumina Paired-end Sequencing Library version 1.0 May 2010

(Agilent, Santa Clara, CA) following the manufacturer’s procedure. Hybridization was

performed with a SureSelect Target enrichment library containing 120-nu baits corresponding

to 52,265 “low copy DNA – TE” high confidence ISBP (Insertion Site-Based Polymorphism)

markers. Following hybridization, captured DNA was amplified to add either index tag #6 or

#12. Equimolar pools of two samples carrying different tags were constructed and sequenced

on an Illumina HiSeq2000 instrument in 2x100 bp paired-end reads. Illumina reads were

mapped against the ISBP reference sequences using the Mosaik package (The MarthLab;

http://bioinformatics.bc.edu/marthlab/wiki/index.php/Software). For MosaikAligner, the

following parameters were used: -mm 3 -act 60 -minp 0.6 –mmal. SNP calling was done with

GigaBayes (The MarthLab; http://bioinformatics.bc.edu/marthlab/wiki/index.php/Software)

using the following parameters: --ploidy diploid --sample multiple --PSL 1 --O 4 --CRL 20.

Finally, GigaBayes output was processed using a Perl script (snpRanking; available upon

request) to classify the SNPs into four different classes according to the detection of

homozygous/heterozygous individuals in the selected panel. Class 1: only homozygous AA

and BB alleles were detected. Class 2: homozygous AA and BB, and heterozygous AB lines

were detected. Class 3: only homozygous AA and heterozygous AB lines were detected. Class

4: only heterozygous AB lines were detected. A heterozygous genotype was considered when

at least 10% of the reads validated the presence of a heterozygous locus. In total, 49,836 SNPs

were discovered including 33,220, 5,857, 8,858 and 1,901 SNPs from Class 1 to 4,

respectively. Only SNPs from Class 1 and 2 were selected for further analyses, representing

39,077 markers. SNP context sequences were then mapped back to 3B sequence scaffolds.

These data were then used to select a subset of evenly distributed SNPs according to the

following criteria: for scaffolds between 80 and 200 kb, a single SNP has been selected; for

scaffolds larger than 200 kb, we selected 1 SNPs every 200 kb so that they were evenly

12

http://bioinformatics.bc.edu/marthlab/wiki/index.php/Software

http://bioinformatics.bc.edu/marthlab/wiki/index.php/Software

distributed along the scaffold. Priority was given to Class 1 SNPs being polymorphic between

Chinese Spring and Renan for mapping purposes. This resulted in a set of 3,735 SNPs (3,390

Class 1 and 345 Class 2) that were submitted to KBioscience (Hoddesdon, UK) for KASPar

assay design. These assays were used to genotype a set of wheat accessions (see section 6).

Eventually, 3,075 SNPs led to useful genotyping results.

1.6.2. Ordering and orientating scaffolds along the chromosome

Scaffolds were ordered and orientated according to their marker content. Molecular markers

were ordered along a genetic map that was refined using LD data (see section 6). Markers

were mapped on scaffold sequences by similarity search of the context sequence using

BLAST (99% identity or more over the full length of the context sequence): 2594 anchor

SNPs have been assigned to 1006 scaffolds. In addition, 1180 additional markers for which a

sequence was available (primers or context sequence) and originating from the consensus map

(see section 6) were also located on the scaffolds. However, a higher weight was given to the

anchor SNPs compared to markers from the consensus map: when both types were present on

a scaffold, only anchor markers were considered. When a scaffold carried 2 or more markers,

the minimum position (i.e. closest to the telomere of the short arm) was considered as the

position of the scaffold along the chromosome. In addition, when those markers have a

different mapping position, we were able to orientate the scaffold along the telomere-

centromere axis. Finally, relative positions of BES were used to order scaffolds that belong to

the same physical contig (because BACs are ordered within each contig). This information

was then used to infer a position for 270 scaffolds without marker information but for which a

neighbor scaffold was genetically anchored. Altogether, we assigned a position to 1358

scaffolds representing 774 Mb (93%; N50=949 kb) shaping the 3B pseudomolecule. Among

them, 489 scaffolds (52% of the size of the pseudomolecule) have been orientated. Finally,

13

1450 scaffolds remained unplaced along the pseudomolecule. They only account for 7% of

the sequence and 6% of the predicted genes.

2. Sequence annotation

2.1. Gene annotation using an improved version of the TriAnnot pipeline

2.1.1. Training ab-initio predictors

Ab-initio gene predictors Augustus (66) and GeneID (67) have been previously trained with a

limited sample of genes (2). Here, we took advantage of having access for the first time to

thousands of wheat genes to improve the accuracy of predictors. Thus, we first performed an

automated gene modeling with a previous untrained release of TriAnnot over the full wheat

3B sequence that led to 9,233 predictions. Out of them, 6,475 were manually checked under

Artemis (68) and their structure was corrected as needed. Among them, 3,273 coding

sequences (CDSs; average length: 1,219 nucleotides: average number of exons: 4.5) had a

structure automatically validated, i.e., each structural feature was supported by biological

evidence (see below). Those were used for training ab-initio predictors of TriAnnot (53)

which significantly improved the specificity of Augustus and both the sensitivity and

specificity of GeneID. In addition, a wheat specific matrix was computed for the Eugene

combiner (69) with our gene sample. A second run of gene modeling was then launched with

the newly trained predictors while mapping manually curated genes back to the chromosome.

2.1.2. Gene modeling and definition of gene islands

Several improvements were made to the TriAnnot pipeline (53) to increase the accuracy and

validate gene models. First, by combining evidence from different sources, we improved the

module responsible for selecting the best gene model at a given locus. Selection is performed

through a scoring method that estimates the accuracy of the CDS structure, based on checking

14

the reliability of the positions of the start codon, the stop codons, and the splicing sites.

TriAnnot checks whether those features correspond to biological data according to spliced

alignments of transcripts and proteins over the chromosome using Exonerate (70). Using

transcript sequences (from RNASeq data generated in this study, and Triticeae ESTs/mRNAs

publicly available), the module can validate, or not, the predicted splicing sites in the CDS

model. For the validation of the predicted start and stop codons, the scoring module considers

the similarity with homologous proteins in related species. Taking into account that

extremities are variable in length and sequence between orthologous proteins, we defined a

range of ten amino acids to consider that the predicted start and stop codons correspond to

that of a protein already identified in another species. This automated procedure to support

gene modeling was used to assign a confidence index: “High Confidence” when all features

are supported by biological evidence, or “Low Confidence” when one or more features are not

supported. The score attributed to a prediction is the sum of the percentage of supported

features, the percentage of amino-acid identity and percentage of overlap with the best

BLAST hit in related plant proteomes. Prediction with the highest score at a given locus was

kept in the released annotation. Finally, ab-initio predictions that do not share any significant

similarity with proteins annotated in plant genomes or without transcription evidence were

considered as false positive, and therefore, discarded from the annotation.

Considering the 7,264 genes and pseudogenes annotated along the 3B pseudomolecule, the

median of intergenic distances of 30 kb was used as a threshold to defined genes that are

clustered into islands.

2.1.3. Classification of gene, pseudogenes, and gene fragments

Each predicted protein sequence was then analyzed in order to determine if the coding

sequence is likely nonfunctional due to mutation or truncation (pseudogene or gene fragment).

15

Gene models displaying internal stop codons, frame shift mutations, or deletions (leaving

between 50% and 70% of the length of a complete homolog) within the CDS were considered

as pseudogenes. The genes showing similarity over less than 50% of the length of their best

homolog in plant protein databank were considered as gene fragments.

2.1.4. Content in protein-coding genes and estimation of the prediction accuracy

In total, 7,703 protein-coding genes were predicted from the 2808 scaffolds of the

chromosome 3B sequence including 7,264 on the pseudomolecule. Manual curation was

performed for 48% of them (3711/7703) by checking and correcting the accuracy of gene

modelling with respect to similarity with transcripts and homologous proteins. The automated

procedure for validation of the CDS coordinates revealed that 59% (4571) are of “High

Confidence”, meaning that biological evidence support positions of start codon, stop codon,

and all splicing sites. To identify potential missing gene models in the annotation, we

compared our annotation (using BLASTN) with the gene models predicted with the MIPS

annotation pipeline from assemblies of whole Chromosome Survey Sequences (CSS; (19).

We focused only on high confidence predictions, i.e. HCS1 to 3 models in the CSS

annotation. Genes were considered missing from our annotation if there was no hit with at

least 99% identity covering more than 50% of the BLAST query or hit length. This resulted in

a list of 1,651 potentially missing genes. Out of these, 839 were found on the chromosome 3B

scaffolds (at least 99% identity for at least 90% of the gene length) and, thus, represent

potential errors of the TriAnnot pipeline annotation. To detect those for which biological

evidence are available, we filtered this dataset by searching for similarity with the

Brachypodium genes (at least 35% identity and 70% overlap) and RNASeq-derived

transcripts produced in this study (99% identity and 90% overlap). This left only 226

potentially missing predictions. These were manually inspected to understand why TriAnnot

16

did not identify them. They could be explained as follows: 1) Part of the gene model was

masked because of similarity with our TE-library; 2) The gene model is a pseudogene; or 3)

The gene model was in fact predicted by TriAnnot, but with a very different structure (the

50% overlap threshold was too stringent). Finally, only 25 gene models predicted by the

MIPS pipeline on the chromosome 3B CSS assemblies and not found in the 3B

pseudomolecule annotation correspond to genes likely functional, showing transcription

evidence and similar to a Brachypodium gene, that were not found by TriAnnot, representing

0.3% of the total predicted gene number.

2.1.5. Using gene models to estimate the quality of wheat whole genome and

chromosome-based shotgun assemblies

The 7,264 gene models of the 3B pseudomolecule were mapped on the sequence assemblies

from the same genotype obtained through genome-wide or chromosome-wide sequencing

approaches. When compared to the 949,279 contigs representing a 5x coverage whole

genome shotgun sequence produced by (8), 79% of the 3B genes aligned to 29,498 contigs

(average size = 499 bp). Overall, 27% of these contigs were assigned to the B-genome and

none of them were anchored to a specific chromosome. When compared with the Illumina

shotgun assembly of the 3B chromosome generated by the IWGSC (19), 95% of the 3B genes

were found on 8,059 contigs (average size 6.8 kb) of which 57% were virtually ordered using

synteny.

2.1.6. GO term annotation and enrichment analysis

Genes were classified in the 3 main GO categories: molecular function, biological process and

cellular component. Out of the 7,264 genes predicted on the pseudomolecule, 5128 (71%)

were associated with at least 1 GO term, representing 1,567 unique GO terms. To determine

17

gene ontology enrichment, similarity search using BLASTP (E-value < 1e-05) was performed

for each predicted gene product against the PLAZA 2.5 protein database (71). Based on the

functional information of the homologs (GO or InterPro [IPR]), consensus functional

information was then transferred to the 3B protein candidate. Functional terms from the 5 best

homologs with majority coverage (i.e., >50%) were considered for the analysis. Then, GOBU

(Gene Ontology Browsing Utility (72)) was used for enrichment calculations. The full set of

3B gene products annotated on the pseudomolecule was used as the reference comparison set

for the enrichment analysis in the R1, R2, and R3 regions. P-values were calculated under

GOBU with the Multiview Plugin and Fisher’s exact test and they were adjusted with

Benjamini and Hochberg's Method using the R module called “p-adjust” (correction for

multiple testing). Finally, the redundancy from the list of enriched GO terms was removed

using the program GO Trimming (73) using default parameters.

2.1.7. Identification of genes putatively involved in resistance to pathogens

Most resistance genes against fungal pathogens identified in plants are from the NBS-LRR

family (nucleotide-binding leucine-rich repeat). Thus, in order to identify wheat gene products

that are putatively related to resistance against pathogens, we used PFAM (Release 27)

domains PF00931 (NB-ARC) (74). We also searched for similarity against domain PF03018

(Dirigent) that represents a family of proteins that are induced during disease response in

plants. In addition, we added all gene products which best BLAST hit in the rice proteome

was annotated as a disease resistance protein. In total, 171 putative resistance genes were

identified on chromosome 3B. Regions R1, R2, and R3 carry 43, 36, and 92 of these genes,

respectively. In addition, 68 are syntenic genes while 103 are nonsyntenic.

18

2.1.8. Non-coding RNA gene predictions

Non-coding RNA genes are usually not or poorly annotated in genome sequencing projects.

Among the few annotated ncRNA genes in GenBank, rRNA and tRNA genes are the most

common ones. In plants, small nuclear RNAs (snRNAs) and small nucleolar RNAs

(snoRNAs) are also well-studied ncRNA families. snoRNAs are a class of ncRNAs that

primarily guides chemical modifications of other RNAs, mainly rRNAs and snRNAs in

eukaryotes. There are two main classes of snoRNAs, the C/D box snoRNAs that are

associated with 2'-O methylation, and the H/ACA box snoRNAs that are associated with

pseudouridylation. snRNAs, also commonly called U-RNAs, are involved in the processing of

pre-mRNAs.

We predicted rRNA genes by using RNAmmer (75) and Rfamscan (v1.0,

ftp://ftp.sanger.ac.uk/pub/databases/Rfam/tools/). RNAmmer was used with both 'euk' and

‘bac’ parameters to find nuclear rRNA genes and those that could originate from the

mitochondrial and/or chloroplastic genomes. In total, 89 5S rRNA genes were retained in the

annotation. They correspond to Rfamscan predictions that share similarity (searched with

BLAST) with known wheat 5S rRNA sequence (X06094) and half of them are organized in

tandem repeats. Regions of similarity with chloroplast and mitochondrial rRNAs were also

found. Those showing also Rfamscan motifs were retained in the final annotation. In addition,

622 tRNA genes were predicted on 3B by tRNAscan-SE (76).

Rfamscan initially predicted the presence of 42 small nuclear RNA (snRNA) genes. Curation

resulted in keeping 22 of them that contain USE and TATA boxes (77). Remarkably, the

number of U1, U2, and U6 candidates in this chromosome is comparable to the total number

of known predictions in the genome of Arabidopsis thaliana. Around 1,250 small nucleolar

RNA (snoRNA) gene candidates were initially predicted by using Rfamscan. Among them,

1,121 were homologous to snoRNA71 and were not considered in this annotation. Manual

19

curation of all other candidates led to validate 92 potential snoRNA genes of which 70 are

organized in 16 clusters.

2.2. TE annotation

2.2.1. Classification of a library of Triticeae transposable element sequences for

similarity-based annotation

TE annotation was performed using a library of 4,929 sequences: 1,543 Gypsy, 797 Mariner,

764 Copia, 554 CACTA, 438 unclassified repeats, 234 LINEs, 175 unclassified DNA

transposons with Terminal Inverted Repeats (TIRs), 168 unclassified Long Terminal Repeats

(LTRs) retroelements, 127 Mutator, 80 Harbinger, 18 unclassified DNA transposons, 16

Helitron, 11 hAT, and 4 SINEs. This data set originated from two sources: the TREP library

(http://wheat.pw.usda.gov/ITMI/Repeats/), and a previous curated annotation of TEs found in

18 Mb from chromosome 3B (2). Based on the full-length TE sequences present in the library

(i.e. elements having TIRs, LTRs and/or features typical from complete LINE/SINE), a first

clustering step was applied for each superfamily. For the class II transposons, Miniature

Inverted-repeat Transposable Elements (MITEs) were considered independently from their

complete parent TE copies in order to create specific clusters for these highly repeated, short,

non-autonomous elements. An all-by-all BLAST alignment was first performed and used to

cluster sequences with MCL (78); using option -I 1.2). Multiple sequence alignments were

then computed with MAFFT (79) for all clusters comprising 3 or more copies. Jalview was

then used to manually inspect every multiple alignment and their related neighbor-joining

tree. For clusters comprising a large number of sequences and when several monophyletic

groups were clearly separated, each subgroup was defined as a sub-variant of the family. For

LTR-retrotransposons, the borders of the two LTRs were searched using the TRsearch

20

program included in REPET (80) for each element. These positions were required for ClariTE

(see below) for the automated curation of the predictions.

2.2.2. Development of ClariTE for high quality automated annotation of TEs

The 2,808 scaffolds composing the 3B chromosome were investigated for TE content using

RepeatMasker (cross_match engine with default parameters; http://www.repeatmasker.org/)

with our curated TE library. Since RepeatMasker does not reconstruct the nested TE patterns

and gives overlapping predictions (one locus can share similarity with several TEs in the

library), we developed the ClariTE program (available upon request) to correct the raw

similarity search results and, consequently, provide high quality TE annotation for

downstream structural and evolutionary analyses. ClariTE is based on our TE library

classification and format (see above). ClariTE performs the three following steps:

a. Resolution of overlapping predictions. To solve the overlap between two

predictions, priority was given to keeping the prediction that covers an extremity of a

TE. If none or both of the predictions cover a TE extremity, priority was given to

keeping the longest prediction and recalculating positions of the other one.

b. Merging predictions. This step is essential to resolve the over-fragmentation of the

predictions. Fragmentation is due, firstly, to the presence of gaps in the scaffolds, and,

secondly, to the fact that a newly identified TE copy may diverge from the reference

element so that one element is not predicted as a single piece but is rather split into

several pieces matching different parts of elements from the same family. In that case,

all neighbor pieces related to the same family were merged into a single feature if the

collinearity of the matching segments was respected, except for LTR matching

segments. Indeed, since LTR positions of reference TEs are known and annotated in

our library, this information was considered during merging process.

21

c. Reconstruction of nested TEs. We developed a procedure to join separated features

that are part of the same TE and have been scattered by more recent insertions (i.e.

shaping nested clusters). Joining was allowed when 2 segments matching the same

family (with respect of the collinearity between the prediction and the reference TE)

are separated at a maximum of 10 predicted TEs. The final stage of the annotation is

the classification of intact full-length TEs versus fragmented TEs. Intact full-length

TEs are predictions covering at least 90% of the reference complete TE in the library

and for which both extremities were identified (in a range of 50 nucleotides).

2.2.3. Solo LTR annotation

Based on of the 30,406 intact RT-LTRs annotated on chromosome 3B, we built a library by

extracting 18,928 LTR sequences that started with TG and ended with CA dinucleotides, the

common motifs found at the border of LTRs in wheat. This library was then used for

similarity search with RepeatMasker. In addition, to distinguish solo LTR from truncated

LTR-RT, we searched for the presence of a 5 bp-Target Site Duplication (TSD; one

nucleotide variation tolerated) flanking the matching region. In total, we detected 3,998 solo

LTRs with TSD.

2.2.4. De novo repeat identification

De novo annotation was performed with the REPET package V2.0 that combines the

TEdenovo and TEannot pipelines (80). REPET runs on sequence contigs rather than on

scaffolds and therefore, we ran it on the 294,691 contigs of assembly version 2.1 (Table S1)

representing 986.1 Mbp. The known TE library was first used to focus the de novo detection

on unknown repeats: every sequence sharing more than 80% identity with a known consensus

TE sequence was excised from the initial sequence. At that stage, only contigs larger than

22

5 kb were considered. Repeats corresponding to microsatellite and repeated genes were

filtered out. Finally, a consensus library of 7,009 elements was built and we kept 1,573

consensus sequences of clusters for which at least one full-length copy was found. The library

was then used for similarity search on the 3B scaffolds previously masked with known TEs.

This allowed us to assign 3.6% of the sequence to a family identified de novo.

2.3. Distribution of TEs

2.3.1. Definition of the centromere

A putative location of the centromere was estimated by plotting along the pseudomolecule the

density of Cereba (called CRW) and Quinta LTR-RTs that are known to be associated with

the active centromere (22). The percentage of CRW and Quinta was estimated in sliding

windows of 10 Mb with a step of 1 Mb. The average percentage of CRW and Quinta elements

along the 3B pseudomolecule is 0.4% ranging from 0.0% to 5.5% (per 10 Mb). Two major

peaks corresponding to regions in which the proportion of the two elements is higher than

1.0% were observed (Fig. S2). The first peak covers 7 Mb (265-272 Mb) with an average of

2.6% of the two elements while the second larger peak covers 68 Mb (319-387 Mb) and

contains on average 3.1% of CRW and Quinta. We then examined the conservation of the 179

genes located in the centromeric region of wheat chromosome 3B and the 22 genes located on

rice chromosome 1 between position 16.7-18.5 Mb that correspond to the region with

complete crossover suppression (23). Fourteen matches were identified indicating that a

majority of the rice genes are found in the 3B putative centromeric region. Combined with

data on recombination rate and linkage disequilibrium (Fig. S2), we considered the 122 Mb

region from position 265 Mb to position 387 Mb as the centromeric/pericentromeric region.

23

2.3.2. Pattern of TE distribution

The TE distribution along chromosome 3B was analyzed in a sliding window of 10 Mb

(step=1 Mb). Segmentation analysis of the global TE content divided the 3B chromosome into

5 regions (Fig. 1C). Boundaries of both telomere regions (0-63 Mb, 700-769 Mb) are very

close to those of regions R1 and R3. In those distal regions, TE proportion is the lowest with

73% and 68%, respectively, (Table 2). Segmentation revealed that the central chromosomal

region could be divided into 3 regions according to their TE content: 64-262 Mb, 263-384 Mb

and 385-699 Mb. The centromere corresponds to the region with the highest percentage of

TEs (93%), while both internal parts of the 3BS and 3BL arms show an intermediate and

equivalent level of TEs: 88%.

TEs distribution was analyzed in more detail by distinguishing the different superfamilies

(gypsy, CACTA, copia; Fig. S6). Their distribution appeared specific to each superfamily and

we observed three completely different patterns along the chromosome. The decrease of TEs

in the distal regions results essentially from a decrease in gypsy content. In contrast, the

amount of CACTA transposons increases towards the distal regions R1 and R3. Finally, the

distribution of the amount of copia is stable along the chromosome at the 10 Mb scale. The

diversity of the types of TEs was also investigated along the chromosome by plotting the

number of families representing 99% of the TE fraction (N99) within a sliding window of

10 Mb. Results showed that the TE diversity is significantly higher in the distal regions than

in the central region.

2.3.3. Relative proportion of TE in the vicinity of genes

To analyze the TE context in the vicinity of genes, we combined the result of both gene and

TEs annotation. Thus, we calculated the average percentages of TEs 20 kb upstream and

downstream on the set of 5,964 filtered CDSs (see section 5.1) present in the pseudomolecule.

Two opposite distribution patterns of TEs super families were found. For the three most

24

abundant super families (gypsy, CACTA, copia), we observed that the abundance increased

with the increasing distance from the genes (Fig. S7). For the other superfamilies, and

particularly for DNA transposons and LINEs, the data indicate that there are enriched in the

10 kb compartments flanking the 5’ or 3’ ends of the CDSs, with a peak around 1 to 5 kb near

the CDSs (Fig. S7).

2.4. TE insertion time

The insertion dates of 21,619 intact LTR retrotransposons were estimated by aligning both

LTR and using a molecular clock as described in (2). Fig. S5A shows the distribution of

insertion time of LTR-RTs. A peak was observed at 1.51 MYA i.e. before the

allopolyploidization, confirming previous findings (2). Decomposing this analysis per family

(with more than 100 estimated dates) led us to assume that each family had its own pattern of

activity (Fig. S5B). Amplification burst could be as recent as 0.6 MYA for the Carmilla

family and as old as 3.2 MYA for the Bare-1 family. In addition, a great variability was also

noticed for the duration of the TE activity. Although most of the families seem to have been

active for a relatively short period of time, some appeared to have been active for several

million years. For example, “Daniela” activity showed a peak spanning about 1 MY; whereas,

“Nusif” elements were active over a period of more than 3 MY (Fig. S5).

3. Expression analyses

3.1. Sample preparation and sequencing

Thirty RNA samples were used for expression analyses. They correspond to RNAs extracted

in duplicates from five organs (root, leaf, stem, spike, and grain) at three developmental

stages each from hexaploid wheat cv. Chinese Spring (4) (Table S2). RNA quality was

assessed using an RNA nano Chip on the Agilent Bioanalyzer (Agilent, 2100) and the RNA

25

integrity number (RIN) was calculated for each sample. Only samples with a RIN higher than

7 were used for library construction. Non-oriented RNA-seq libraries were constructed from 4

µg of total RNA using the IlluminaTruSeqTM RNA sample preparation Kit (Illumina,

#15008136) according to the manufacturer's protocol, with a library insert fragmentation time

of 12 min. Illumina index were used to pool two samples per lane. Libraries were sequenced

on an Illumina HiSeq2000 with 2 x 100-bp paired-end reads. Read quality was checked with

the FastQC v0.10.0 software (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

3.2. Read mapping, expression analysis, and detection of alternative splicing

RNA-Seq Illumina reads obtained from the 30 samples (see above) were mapped on the

chromosome 3B scaffolds using Tophat2 v2.0.8 (81) and bowtie2 (82) with the default

parameters except: 0 mismatch, 0 splice-mismatch. PCR duplicates were removed with

Samtools (83) with the rmdup option and an annotation-guided read alignment was performed

with Cufflinks v2.1.1 (84) to reconstruct transcripts and estimate transcript abundance in units

of fragments per kb of exon per million mapped reads (FPKM, (85)). Regions with FPKM

values higher than zero were considered as expressed. RNA-Seq data revealed the presence of

expressed regions that were not annotated by TriAnnot and were called "novel transcribed

regions" (NTRs). Alternative spliced (AS) transcripts were identified using Cufflinks (default

parameters; option -g) and all isoforms were analyzed with MATS (86).

3.3. Segmentation/change-point analysis

Segmentation analyses were performed using the R package changepoint v1.0.6 (54) with

Segment Neighbourhoods method and BIC penalty on the mean change. Segmentation was

applied to the distribution (sliding window size: 10 Mb, step: 1 Mb) along the chromosome of

the following features: recombination rate, TE content, gene density, and expression breadth.

26

4. Chromosome partitioning in barley and maize

4.1. Partitioning of the barley chromosomes

We retrieved publicly available genetic recombination and gene expression data as well as

GO-terms, and gene positions along the 7 barley chromosomes from Mayer et al. (34)

(available for download at http://mips.helmholtz-muenchen.de/plant/barley/download/). We

performed the same analysis as for chromosome 3B by computing the pattern of

recombination rate along the chromosomes in a 10 Mb sliding window with a step of 1 Mb. In

each window, the largest distance between 2 markers and their relative position in cM were

considered to calculate a ratio of cM per Mb. Segmentation analysis was then applied (see

paragraph "segmentation/change-point analysis") in order to identify the borders between the

distal high-recombination and the proximal low-recombination regions. A threshold of 0.40

cM/Mb was used to define high versus low-recombination segments (the average for the

whole genome was 0.25 cM/Mb). The coordinates (in Mb) of the high-recombination

segments are the following: chr1: start-37, 388-end; chr2: start-60, 465-end; chr3: start-21,

466-end; chr4: start-42, 462-end; chr5: start-33, 388-end; chr6: start-27, 488-end; chr7: start-

89, 505-end. For each chromosome, a GO-term enrichment analysis was applied as described

in section "GO term annotation and enrichment analysis". We computed the average

expression breath (out of 8 tested conditions) in a 10 Mb sliding window with a step of 1 Mb.

Average was calculated only with genes expressed in at least 1 condition. As described by the

authors (34), we used a FPKM threshold of 0.4 to consider that a gene is expressed. We then

performed the segmentation analysis as described above. The coordinates of high/low-

recombination regions were considered to calculate the average expression breadth in high

versus low-recombination regions and a Welch t-test was performed to validate the statistical

significance of the differences observed. The pattern of expression breadth along the 7 barley

27

chromosomes and the defined segments are represented in Fig. S4A. As observed for

chromosome 3B, we found that high-recombination regions in barley carry genes expressed in

fewer conditions (5.9/8) than low-recombination regions (6.7/8). This difference is

statistically significant (p-value<0.01) for 6 out the 7 chromosomes (exception is chromosome

4H).

4.2. Partitioning of the maize chromosomes

We retrieved publicly available data of the IBM (B73×Mo17) population from Ganal et al.

(33) in order to investigate genetic recombination rate along the 10 maize chromosomes. We

computed the pattern of recombination rate along the chromosomes in a 10 Mb sliding

window with a step of 1 Mb following the same approach described above for barley and

applied a segmentation analysis. A threshold of 1.5 cM/Mb was used to define the borders of

the high-recombination regions (the average for the whole genome was 0.8 cM/Mb). The

coordinates (in Mb) of the high-recombination segments are the following: chr1: start-9, 276-

end; chr2: start-17, 178-end; chr3: start-13, 200-end; chr4: start-11; chr5: start-14; chr6: start-

7, 141-end; chr7: start-10, 153-end; chr8: start-20, 152-end; chr9: start-21, 130-end; chr10:

start-5, 125-end. GO-term enrichment was applied has described above (the maize functional

annotation with GO-terms was retrieved at http://ftp.maizesequence.org/release-4a.53/).

Segmentation based on the average expression breadth of genes was performed using publicly

available expression data in 18 conditions from Sekhon et al. (35). The average expression

breath (considering expressed genes only) was computed in a 10 Mb sliding window with a

step of 1 Mb and we computed the corresponding segmentation. The coordinates of high/low-

recombination regions were considered to calculate the average expression breadth in these

two types of regions. A Welch t-test was performed to validate the statistical significance of

the differences observed. The pattern of average expression breadth along the 10 maize

28

chromosomes and the defined segments are represented in Fig. S4B. Contrary to what was

observed for chromosome 3B, in maize, high and low-recombination regions carry genes with

similar expression breadth: 13.2 and 12.7 (/18), respectively.

5. Comparative analyses

5.1. Identification of syntenic and nonsyntenic genes on chromosome 3B

Three fully sequenced grass species were chosen for the comparative analyses: rice (Os;

source: MSU version 7.0), Brachypodium (Bd; source: Brachypodium Sequencing Initiative,

2.0), and sorghum (Sb; source: phytozome, version 1.4). All TEs and alternative splice

variants were removed from the gene sets of all species. Aminoacid sequences were used in

an all-by-all BLAST (cutoff e-value of 1e-5) between the proteomes from each of the species

compared. Syntenic genes in each of the species were defined as genes with the reciprocal

best BLAST hit (RBH) on an orthologous chromosome in at least one other species (Ta3B,

Bd2, Os1 or Sb3). The exact borders of the Brachypodium orthologous region on

chromosome 2 were defined by visualizing the reciprocal best BLAST hits (RBHs) with rice

chromosome 1 under the Artemis Comparison Tool (87). Nonsyntenic genes in each species

were defined as genes for which best BLAST hit was on a non-orthologous chromosome in

the compared species. An extra round of filtration was applied to the gene sets in order to

remove lineage-specific genes and possible mis-annotations. The filtration consisted of

removing all genes from each species with no homology to at least one other gene in a

compared species (at least 35% amino acid identity, and 35% sequence overlap). A Venn

diagram was constructed by counting the number of genes in each species with their best

BLAST hits on orthologous chromosomes in none, one, two, or three other species. In order

to compare the 3B annotation to the barley 3H annotation (34), we searched for similarity

between the 3B genes and the 2,478 gene models anchored on 3H (Evalue cutoff: 1e-5). We

29

then counted all the syntenic or nonsyntenic genes with a 3H hit having at least 35% identity,

and 35% query and hit overlap.

5.2. Collinearity

Collinearity was detected between the genes located on chromosomes 3B, Os1, Bd2, and Sb3

using the program MCScanX (88). A collinear block was defined as a conserved set of at least

5 genes (anchors) in the same order between 2 genomes, with a maximum of 25 spacer genes

between the anchors in a collinear block. All the amino acid sequences of the filtered gene

sets of each species were used in an all-by-all BLASTp comparison (using E-value cutoff at

1e-10). Collinearity was visualized using the Circos software (89).

5.3. Intra- and Inter-chromosomal duplications

The percentage of intra-chromosomal duplicates (paralogs) on 3B was determined using the

software OrthoMCL (e-value cutoff: 1e-5, percent match cutoff: 35%) (55). The software

produced clustered families of putative orthologs (homologs between species, originating

from the common ancestor) and paralogs (duplicates within a species) on the basis of

sequence similarity. Therefore, we classified all 3B genes that were clustered into the same

family as intra-chromosomal duplicates, i.e. genes with at least one other member in its family

on the same chromosome. 3B genes clustered in a family with wheat gene models annotated

on another chromosome (19), not including genes from group 3, were considered as inter-

chromosomal duplicates. Tandem duplicates were defined as genes in the same family with 5

or less spacer genes separating them on the pseudomolecule, and dispersed duplicates were

defined as having more than 5 spacer genes.

30

In order to better understand the nonsyntenic gene origin, we searched for the ancestral locus

for each nonsyntenic gene, i.e. the “parental” gene on another chromosome that has been

duplicated and inserted onto 3B. To do this, we used the gene models defined from the CSS

contigs of the 20 other chromosomes (19). We used the program

detect_collinearity_within_gene_families.pl (part of the MCScanX package) and input the

OrthoMCL-determined families consisting of clusters containing 3B nonsyntenic genes and

rice, Brachypodium, sorghum, and wheat-non group 3 homologs. If the best BLAST hits of a

3B nonsyntenic gene were 1) clustered into the same family, 2) collinear among each other,

and 3) collinear with another CSS gene model in the same family, we considered this CSS

gene model as the parental copy. When several parental genes were detected (for example on

the A, B, and D genomes), the one showing the highest score was chosen as the parent. This

resulted in 152 genes for which a parent gene was defined.

To estimate the exact proportion of nonsyntenic 3B genes that originate from inter-

chromosomal duplication while avoiding underestimation due to annotation problems, we

searched for similarity with the full set of CSS assembled contigs of the 18 non

homoeologous wheat chromosomes (not limited to gene models). By using the parameters of

at least 80% nucleotide identity and at least 100 bp aligned as thresholds, we were able to

estimate that 1793 of 2065 (87%) nonsyntenic genes share significant similarity with contigs

from a non homoeologous chromosome and, thus, may originate from inter-chromosomal

duplication. For some of them, no gene model was annotated (partially assembled, split into

several contigs, etc.) and we were not able to detect an annotated parent gene copy using

OrthoMCL clusters only.

31

5.4. Calculation of synonymous (Ks) and nonsynonymous (Ka) substitution rates

Ka and Ks rates were calculated by, first, removing pseudogenes and then comparing the

coding sequence of a nonsyntenic gene and its parent copy. Alignments were made with

ClustalW version 2.1 (56). Rates were calculated by the Nei and Gojobori method using

codeml (part of the PAML package; (57)). Age of gene divergence was estimated by the

equation Ks/2r, where r=6.5e-9.

6. Construction of a genetic map and LD mapping

6.1. Plant material and DNA extraction

Deletion mapping of the SNPs was performed using cytogenetic stocks of cv. Chinese Spring

including a nullisomic 3B-tetrasomic 3A line (90), two ditelosomic 3B lines (91), and 14

deletion lines (3BS-3, 3BS-8, 3BS-7, 3BS-9, 3BS-2, 3BS-4, 3BS-1, 3BS-5, 3BL-2, 3BL-8,

3BL-1, 3BL-9, 3BL-10, 3BL-7) (92). The CsRe single seed descent (SSD) population was

derived from a cross between Chinese Spring (Cs) and Renan (Re) using Cs as female parent.

Twelve F1 plants were selfed to produce F2 seeds among which ~1,500 were sown and selfed

to produce F3 families. Fifteen seeds of each family were then sown and the second plant of

each line was systematically selected and selfed to produce F4 seeds. The same procedure was

applied at each generation until F8 families were obtained. The CsRe SSD population consists

in 1,269 individuals among which a set of 305 was randomly chosen for genetic mapping of

SNPs. DNA was extracted as described in Graner et al. (93) on bulks of leaves collected on 10

F7-seedlings and is thus representative of the F6 generation. Two collections of wheat lines

were used for linkage disequilibrium (LD) mapping: 367 accessions originating from around

the world and representing 98% of the world diversity as estimated using a set of SSR

markers (94). Since this collection exhibits a low structuration, LD values are thus relatively

32

low (95); 353 wheat varieties derived from elite European material in which LD values are

much higher were selected.

6.2. Genetic mapping

A genetic map of chromosome 3B was constructed using 305 individuals selected from the

CsRe SSD population. Linkage estimation was based on the maximum likelihood method

using CarthaGene (http://www7.inra.fr/mia/T/CarthaGene/) (96) with LOD and ө values of 5

and 0.25 respectively using the Kosambi mapping function to transform recombination

fractions into centimorgans. The chromosome 3B-consensus map was built following the

same strategy described for the IBM map in maize (97) and for the initial physical map of

chromosome 3B (17) using the CsRe-SSD genetic map from chromosome 3B described

above as a framework map on which the position of loci mapped in another population was

extrapolated. The consensus map was constructed using segregating data from the following

40 mapping populations: Chinese Spring x Renan (SSD population) as a framework; the CsRe

F2 population used to anchor the initial version of the physical map of chromosome 3B (17);

18 DH, RIL and F2 wheat Australian populations used for DArT marker genotyping (98); a

composite wheat map integrating 12 maps (http://wheat.pw.usda.gov/cmap/); the new ITMI

population (DH lines) (99); two DH populations Apache x Balance and Alchemy x Robigus

developed respectively by Florimond-Desprez and RAGT; the Chinese Spring x Courtot DH

population (100); RL4452 x AC Domain, SC8021V2 x AC Karma, (two populations from

Agriculture Canada (101)); two populations (Avalon x Cadenza and Savannah x Rialto) used

to map SNP markers developed in the course of the CerealsDB project

(http://www.cerealsdb.uk.net/CerealsDB/SNPS/Documents/DOC_snps.php; (102-104).

33

6.3. Linkage Disequilibrium (LD) mapping

LD mapping aims at ordering markers according to their most likely position on the

pseudomolecule based on the fact that correlation coefficient (r²) values are higher when

markers are physically close. The initial order of the markers was based first on deletion-bin

mapping and second on genetic mapping. The LD-mapping strategy was applied when genetic

mapping failed to find recombinant individuals between two or more markers located in the

same genetic bin. For each subset of markers located at the same genetic distance and in the

same deletion bin, r² values were computed using Tassel 4.1.32 software (105). The data file

was filtered for rare alleles (a percentage less than 5% in the whole population), thus LD

values cannot be biased by low allele frequencies. In order to make results clearer, r² values

between pair of markers were calculated in two contrasted populations exhibiting different

levels and extent of LD. Then, markers were ordered according to their respective r² values

using homemade software based on the salesman problem. The advantage of such a method is

that it also permits defining the position of new markers that are not polymorphic on mapping

populations. Finally, the position of each marker in each block was checked and manually

curated when necessary and LD blocks were numbered according to their most likely position

on the genetic map. Markers for which it was not possible to define an optimal order were

attributed the same LD block number.

7. metaQTL analysis and projection of the confidence intervals on the 3B

pseudomolecule

A survey of the literature and of our own data identified 121 quantitative trait loci (QTLs) for

50 different traits with r² values ranging from 0.01 to 0.48 and confidence intervals from 5 to

121 cM on chromosome 3B (Table S12). The genetic maps used for QTL detection comprised

between 3 and 51 markers among which 50% to 100% of the markers were found on our 3B

34

genetic consensus map. MetaQTL analysis was computed with Biomercator (106). It allowed

the projection of 116 of the 121 QTLs on this map. The most likely metaQTL model (BIC

criteria) identified five metaQTLs with confidence intervals ranging from 1.8 to 11.9 cM.

Thirteen metaQTLs reported in the literature with confidence intervals ranging from 0 to 49.6

cM were added leading to a total of 18 metaQTLs on chromosome 3B (Table S13). By

comparing the sequence of the markers flanking the confidence intervals, each metaQTL

interval was assigned to a sequenced region. The 18 metaQTLs covered between 1.5 and 620

Mb.

35

Supplementary tables

Table S1: Features of the different assemblies from chromosome 3B sequences from the first raw assembly up to the pseudomolecule.

raw assembly

manual curation of scaffolding

automated gap filling

assignation + removal of

contamination

scaffold assembly based on

known overlap

scaffold assembly based on unknown overlap

(within each deletion bin)

scaffold assembly based

on overlap predicted by shared TE-junctions

pseudo-molecule

v2.1 v3.0 v4.0 v4.1.2 v4.2.2 v4.3.2 v4.4.3

all contigs contigs in scaffolds

contigs cumulated size (bp) 986,092,508 855,419,705 869,221,962 916,084,767 915,399,659 904,766,638 852,223,227 773,308,608 723,118,112 number 294,691 130,521 149,199 54,720 54,672 53,803 49,157 43,267 37,954 N50 (bp) 8,590 11,920 11,887 41,163 41,076 41,356 42,475 44,324 45,448 L50 25,939 18,085 18,420 6,250 6,264 6,142 5,639 4,923 4,509 average size (bp) 3,346 6,554 5,826 16,741 16,743 16,816 17,337 17,873 19,052 max size (bp) 163,292 163,292 163,292 512,434 512,434 512,452 512,452 512,452 512,452 scaffolds cumulated size (bp) 1,040,382,486 995,646,481 992,866,407 992,338,434 980,625,305 920,921,491 832,800,924 774,434,471 number 16,136 5,095 5,109 4,999 4,747 3,769 2,808 1,358 N50 (bp) 274,769 464,172 462,955 462,955 494,575 639,215 892,435 949,321 L50 1,048 682 683 683 606 450 296 264 average size (bp) 64,476 195,416 194,337 198,507 206,578 244,341 296,582 570,176 max size (bp) 1,338,329 1,648,264 1,638,993 1,638,993 2,795,397 3,884,599 4,169,843 4,169,843 amount of Ns 184,963,128 126,424,892 76,845,121 77,002,074 75,921,345 68,757,059 59,545,911 51,365,845 % of Ns 17.8% 12.7% 7.7% 7.8% 7.7% 7.5% 7.2% 6.6%

36

Table S2: Table of the Zadoks decimal code for wheat growth stages (107). Stages with a

cross were used for the RNA-Seq analysis

Stage Wheat growth stage Feekes scale

Zadoks scale

Leaves Root Stem Spike Grain

Seedling First leaf through coleoptile

1 10 X X

Three leaves 3 leaves unfolded 13 X Three tillers Main shoot and 3 tillers 23 X Spike at 1 cm Pseudostem erection 5 30 X Two nodes 2nd detectable node 7 32 X X Meiosis Flag leaf ligule and ollar

visible 9 39 X X

Anthesis 1/2 of flowering complete 65 X X 2 DAAs (50°C.days)

Kernel (caryopsis) watery ripe

71 X X

14 DAAs (350°C.days)

Medium Milk 75 X

30 DAAs (700°C.days)

Soft dough 85 X

37

Table S3: Overrepresented R1 GO terms. The table is divided into three sections: Cellular

component (C), molecular function (F), and biological process (P). The Depth is the depth in

the GO hierarchy. GO term redundancy was removed by GO trimming. Counts of genes and

adjusted p-values are shown for each segment (R1, R2, R3). Significant p-values are in bold.

Term Name GOID Term Type Depth

# genes (R1)

adjusted R1 p-value

# genes (R2)

adjusted R2 p-value

# genes (R3)

adjusted R3 p-value

membrane-bounded vesicle GO:0031988 C 4 267 1.01E-12 500 1.00E+00 169 2.08E-01 vesicle GO:0031982 C 3 267 1.11E-12 501 1.00E+00 169 2.08E-01 cytoplasmic vesicle GO:0031410 C 4 267 1.11E-12 501 1.00E+00 169 2.08E-01 intracellular membrane-bounded organelle GO:0043231 C 4 510 6.08E-08 1369 1.00E+00 326 1.00E+00 organelle GO:0043226 C 2 525 1.09E-06 1456 1.00E+00 339 1.00E+00 intracellular organelle GO:0043229 C 3 525 1.09E-06 1456 1.00E+00 339 1.00E+00 intracellular part GO:0044424 C 3 552 1.39E-06 1563 1.00E+00 348 1.00E+00 ubiquitin ligase complex GO:0000151 C 4 8 2.20E-02 5 1.00E+00 0 1.00E+00 small conjugating protein ligase activity GO:0019787 F 6 75 2.32E-30 23 1.00E+00 5 1.00E+00 protein kinase activity GO:0004672 F 6 170 3.43E-20 223 1.00E+00 56 1.00E+00 transferase activity GO:0016740 F 3 240 8.97E-11 493 1.00E+00 115 1.00E+00 adenyl ribonucleotide binding GO:0032559 F 6 227 1.47E-07 460 1.00E+00 161 4.82E-02 purine nucleoside binding GO:0001883 F 4 233 5.32E-07 485 1.00E+00 169 4.14E-02 enzyme regulator activity GO:0030234 F 2 38 1.39E-06 41 1.00E+00 5 1.00E+00 catalytic activity GO:0003824 F 2 516 1.43E-05 1495 1.00E+00 300 1.00E+00 carboxylesterase activity GO:0004091 F 5 23 1.28E-03 25 1.00E+00 5 1.00E+00 endopeptidase inhibitor activity GO:0004866 F 5 16 3.63E-03 17 1.00E+00 0 1.00E+00 binding GO:0005488 F 2 572 2.10E-02 1737 1.00E+00 421 1.00E+00 post-translational protein modification GO:0043687 P 6 243 1.55E-38 266 1.00E+00 67 1.00E+00 protein modification process GO:0006464 P 5 249 4.60E-37 289 1.00E+00 71 1.00E+00 modification-dependent protein catabolic process GO:0019941 P 7 74 4.20E-32 17 1.00E+00 6 1.00E+00 cellular protein metabolic process GO:0044267 P 5 285 1.54E-29 429 1.00E+00 97 1.00E+00 protein modification by small protein conjugation GO:0032446 P 7 71 1.81E-29 20 1.00E+00 5 1.00E+00 cellular catabolic process GO:0044248 P 4 87 4.83E-23 58 1.00E+00 10 1.00E+00 developmental process GO:0032502 P 2 83 1.84E-22 51 1.00E+00 12 1.00E+00 protein metabolic process GO:0019538 P 4 308 3.79E-22 557 1.00E+00 120 1.00E+00 multicellular organismal process GO:0032501 P 2 86 9.76E-20 62 1.00E+00 18 1.00E+00 cellular macromolecule metabolic process GO:0044260 P 4 394 7.54E-19 866 1.00E+00 155 1.00E+00 phosphorylation GO:0016310 P 6 174 1.56E-14 286 1.00E+00 58 1.00E+00 cellular metabolic process GO:0044237 P 3 462 5.28E-13 1190 1.00E+00 193 1.00E+00 primary metabolic process GO:0044238 P 3 493 3.45E-10 1331 1.00E+00 238 1.00E+00 lipid localization GO:0010876 P 4 17 9.79E-06 9 1.00E+00 0 1.00E+00 DNA metabolic process GO:0006259 P 5 43 1.25E-02 74 1.00E+00 21 1.00E+00 cellular amino acid metabolic process GO:0006520 P 4 29 1.30E-02 47 1.00E+00 7 1.00E+00 cellular amine metabolic GO:0044106 P 5 29 1.60E-02 48 1.00E+00 7 1.00E+00

38

process developmental maturation GO:0021700 P 3 12 2.45E-02 13 1.00E+00 0 1.00E+00 aromatic amino acid family metabolic process GO:0009072 P 5 5 2.98E-02 1 1.00E+00 0 1.00E+00 L-phenylalanine metabolic process GO:0006558 P 6 4 2.99E-02 0 1.00E+00 0 1.00E+00 aromatic amino acid family catabolic process GO:0009074 P 6 4 2.99E-02 0 1.00E+00 0 1.00E+00 response to abiotic stimulus GO:0009628 P 3 19 3.34E-02 30 1.00E+00 1 1.00E+00




adjusted p-values are shown for each segment (R1,R2,R3). Significant p-values are in bold.


#genes (R1)

adjusted R1 p-value

# genes (R2)

adjusted R2 p-value

# genes (R3)

adjusted R3 p-value

macromolecular complex GO:0032991 C 2 38 1.00E+00 254 1.60E-05 32 1.00E+00 protein complex GO:0043234 C 3 25 1.00E+00 155 1.13E-04 12 1.00E+00 membrane part GO:0044425 C 3 29 1.00E+00 194 3.06E-04 25 1.00E+00 intrinsic to membrane GO:0031224 C 4 17 1.00E+00 134 2.36E-03 19 1.00E+00 proton-transporting ATP synthase complex GO:0045259 C 4 1 1.00E+00 22 1.41E-02 0 1.00E+00 ribonucleoprotein complex GO:0030529 C 3 10 1.00E+00 83 4.54E-02 13 1.00E+00 nucleoside-triphosphatase activity GO:0017111 F 7 10 1.00E+00 144 9.80E-08 11 1.00E+00 hydrolase activity GO:0016787 F 3 93 1.00E+00 586 1.23E-07 101 1.00E+00 hydrolase activity, acting on acid anhydrides GO:0016817 F 4 13 1.00E+00 174 7.63E-07 21 1.00E+00 ATPase activity, coupled GO:0042623 F 9 3 1.00E+00 76 3.94E-06 4 1.00E+00 exonuclease activity GO:0004527 F 6 2 1.00E+00 56 1.13E-04 3 1.00E+00 P-P-bond-hydrolysis-driven transmembrane transporter activity GO:0015405 F 6 2 1.00E+00 50 5.38E-04 3 1.00E+00 nuclease activity GO:0004518 F 5 8 1.00E+00 80 6.37E-04 6 1.00E+00 hydrolase activity, acting on acid anhydrides, catalyzing transmembrane movement of substances GO:0016820 F 4 2 1.00E+00 47 1.25E-03 3 1.00E+00 ATPase activity, coupled to movement of substances GO:0043492 F 10 2 1.00E+00 47 1.25E-03 3 1.00E+00 helicase activity GO:0004386 F 8 3 1.00E+00 38 1.47E-03 0 1.00E+00 nucleic acid binding GO:0003676 F 3 85 1.00E+00 422 2.34E-03 75 1.00E+00 purine NTP-dependent helicase activity GO:0070035 F 9 1 1.00E+00 26 3.56E-03 0 1.00E+00 monovalent inorganic cation transmembrane transporter activity GO:0015077 F 8 7 1.00E+00 44 1.38E-02 0 1.00E+00

39

transcription regulator activity GO:0030528 F 2 28 1.00E+00 142 1.43E-02 16 1.00E+00 antioxidant activity GO:0016209 F 2 2 1.00E+00 45 2.31E-02 6 1.00E+00 oxidoreductase activity, acting on peroxide as acceptor GO:0016684 F 4 2 1.00E+00 45 2.31E-02 6 1.00E+00 cation transmembrane transporter activity GO:0008324 F 6 8 1.00E+00 62 2.55E-02 6 1.00E+00 cellular biosynthetic process GO:0044249 P 4 105 1.00E+00 540 9.80E-08 67 1.00E+00 cellular macromolecule biosynthetic process GO:0034645 P 5 88 1.00E+00 435 3.59E-06 52 1.00E+00 gene expression GO:0010467 P 4 85 1.00E+00 418 1.60E-05 53 1.00E+00 regulation of cellular process GO:0050794 P 3 65 1.00E+00 344 2.20E-05 43 1.00E+00 regulation of transcription, DNA-dependent GO:0006355 P 6 56 1.00E+00 267 2.46E-04 27 1.00E+00 generation of precursor metabolites and energy GO:0006091 P 4 19 1.00E+00 112 1.24E-03 7 1.00E+00 purine ribonucleotide metabolic process GO:0009150 P 7 1 1.00E+00 43 1.25E-03 3 1.00E+00 energy coupled proton transport, down electrochemical gradient GO:0015985 P 5 1 1.00E+00 29 1.47E-03 0 1.00E+00 ATP biosynthetic process GO:0006754 P 10 1 1.00E+00 29 1.47E-03 0 1.00E+00 purine ribonucleoside triphosphate metabolic process GO:0009205 P 8 1 1.00E+00 40 2.57E-03 3 1.00E+00 cellular localization GO:0051641 P 3 5 1.00E+00 63 3.10E-03 6 1.00E+00 purine-containing compound biosynthetic process GO:0072522 P 6 1 1.00E+00 39 3.33E-03 3 1.00E+00 nucleobase, nucleoside, nucleotide and nucleic acid metabolic process GO:0006139 P 4 116 1.00E+00 464 3.57E-03 66 1.00E+00 cation transport GO:0006812 P 5 5 1.00E+00 55 4.24E-03 4 1.00E+00 oxidative phosphorylation GO:0006119 P 5 7 1.00E+00 54 5.48E-03 2 1.00E+00 signaling GO:0023052 P 2 7 1.00E+00 80 6.22E-03 11 1.00E+00 protein transport GO:0015031 P 4 4 1.00E+00 52 8.97E-03 5 1.00E+00 cellular protein localization GO:0034613 P 5 4 1.00E+00 52 8.97E-03 5 1.00E+00 protein localization GO:0008104 P 4 6 1.00E+00 58 9.11E-03 5 1.00E+00 intracellular transport GO:0046907 P 4 5 1.00E+00 55 1.84E-02 6 1.00E+00 cellular response to stimulus GO:0051716 P 3 10 1.00E+00 85 1.95E-02 12 1.00E+00 intracellular signal transduction GO:0035556 P 4 4 1.00E+00 42 2.30E-02 3 1.00E+00 cellular component organization or biogenesis GO:0071840 P 2 14 1.00E+00 111 4.08E-02 20 1.00E+00 nucleic acid metabolic process GO:0090304 P 5 112 1.00E+00 402 4.75E-02 54 1.00E+00 nucleotide biosynthetic process GO:0009165 P 6 1 1.00E+00 42 4.86E-02 7 1.00E+00 protein complex subunit organization GO:0071822 P 5 2 1.00E+00 26 4.90E-02 1 1.00E+00

40




adjusted p-values are shown for each segment (R1, R2, R3). Significant p-values are in bold.


#genes (R1)

adjusted R1 p-value

#genes (R2)

adjusted R2 p-value

#genes (R3)

adjusted R3 p-value

purine nucleoside binding GO:0001883 F 4 233 5.32E-07 485 1.00E+00 169 4.14E-02 secondary active transmembrane transporter activity GO:0015291 F 5 10 1.00E+00 10 1.00E+00 17 2.01E-03 adenyl ribonucleotide binding GO:0032559 F 6 227 1.47E-07 460 1.00E+00 161 4.82E-02 response to stimulus GO:0050896 P 2 66 1.00E+00 239 1.00E+00 125 2.13E-12 response to stress GO:0006950 P 3 47 1.00E+00 150 1.00E+00 101 2.10E-14 programmed cell death GO:0012501 P 4 34 1.00E+00 40 1.00E+00 85 1.48E-27 drug transport GO:0015893 P 4 9 8.90E-01 5 1.00E+00 14 2.91E-03 sphingoid metabolic process GO:0046519 P 7 0 1.00E+00 1 1.00E+00 5 4.66E-02

41

Table S6: Number and proportion of filtered syntenic and nonsyntenic genes in the

pseudomolecule of wheat chromosome 3B, and the orthologous chromosomes in

Brachypodium, rice, and sorghum. Syntenic genes have their best BLAST hit on an

orthologous chromosome in at least one compared species.

Species (chromosome)

Syntenic genes Nonsyntenic genes

Count % Count %

T. aestivum (3B) 3899 65.4% 2065 34.6%

B. distachyon (2) 3310 95.7% 149 4.3%

O. sativa (1) 3601 94.8% 196 5.2%

S. bicolor (3) 3449 94.3% 207 5.7%

Table S7: Proportion of collinear genes between orthologous chromosomes of four grass

species. The values indicate the percent of genes in species 1 that are collinear with the

genome of species 2 (tae: T. aestivum; bdi: B. distachyon; osa: O. sativa; sbi: S. bicolor).

Species 1 Species 2

tae3B bdi2 osa1 sbi3

tae3B 50.7% 42.2% 41.9%

bdi2 29.4% 62.5% 63.0%

osa1 26.9% 68.6% 68.1%

sbi3 25.7% 66.6% 65.6%

42

Table S8: Proportion of expressed genes on the 3B pseudomolecule and expression breadth.

Percentage of expressed genes are indicated for syntenic genes, nonsyntenic genes, singletons

(i.e., genes with no evidence of intra-chromosomal duplication origin), and genes potentially

originating from intra-chromosomal duplication.

syntenic nonsyntenic singletons

Intra-chr. duplicates

% genes expressed in at least 1 condition (/15)

81.9% 68.6% 83.6% 61.6%

average number of conditions (/15) with expression evidence*

12.0±4.6 9.2±5.5 11.8±4.7 9.4±5.4

% pseudogenes/gene fragments 17.2% 31.6% 22.0% 22.6% *Out of the expressed genes

43

Table S9: GO term enrichment analysis between syntenic and nonsyntenic genes. The table is

divided into three sections: Cellular component (C), molecular function (F), and biological

process (P). The Depth is the depth in the GO hierarchy. GO term redundancy was removed

by GO trimming. Counts of genes (#) and adjusted p-values are shown for syntenic and

nonsyntenic genes. Significant p-values are in bold.

all

genes nonsyntenic

genes syntenic genes Term Name GOID Depth # # p-value # p-value

BIOLOGICAL PROCESSES developmental process GO:0032502 2 146 16 1.0000 109 0.0000 modification-dependent protein catabolic process

GO:0019941 7 97 13 1.0000 77 0.0000

cellular catabolic process GO:0044248 4 155 22 1.0000 114 0.0000 multicellular organismal process GO:0032501 2 166 23 1.0000 120 0.0000 protein modification by small protein conjugation

GO:0032446 7 96 15 1.0000 75 0.0000

regulation of biological process GO:0050789 2 510 131 1.0000 307 0.0068 nucleic acid metabolic process GO:0090304 5 568 134 1.0000 339 0.0131 regulation of metabolic process GO:0019222 3 422 100 1.0000 255 0.0251 cellular macromolecule metabolic process

GO:0044260 4 1415 443 1.0000 795 0.0422

regulation of cellular process GO:0050794 3 452 114 1.0000 270 0.0463 programmed cell death GO:0012501 4 159 103 0.0000 42 1.0000 lipid localization GO:0010876 4 26 21 0.0000 4 1.0000 chromatin assembly GO:0031497 5 25 20 0.0000 5 1.0000 nucleosome organization GO:0034728 6 25 20 0.0000 5 1.0000 protein-DNA complex assembly GO:0065004 6 25 20 0.0000 5 1.0000 protein-DNA complex subunit organization

GO:0071824 5 25 20 0.0000 5 1.0000

response to stress GO:0006950 3 298 137 0.0000 113 1.0000 chromatin organization GO:0006325 7 42 27 0.0038 14 1.0000 macromolecular complex assembly GO:0065003 5 53 32 0.0038 17 1.0000 cellular component assembly at cellular level

GO:0071844 5 37 24 0.0038 10 1.0000

respiratory electron transport chain GO:0022904 5 33 22 0.0038 0 1.0000 oxidative phosphorylation GO:0006119 5 63 36 0.0038 13 1.0000 phosphorylation GO:0016310 6 518 214 0.0063 261 1.0000

44

cellular component assembly GO:0022607 4 55 32 0.0063 18 1.0000 macromolecular complex subunit organization

GO:0043933 4 55 32 0.0063 19 1.0000

photosynthesis, light reaction GO:0019684 5 15 12 0.0089 2 1.0000 cellular component organization at cellular level

GO:0071842 4 78 40 0.0306 32 1.0000

cellular component organization GO:0016043 3 105 51 0.0326 45 1.0000 viral infectious cycle GO:0019058 4 6 6 0.0353 0 1.0000 transmembrane transport GO:0055085 3 59 31 0.0570 18 1.0000

CELLULAR COMPONENT intracellular membrane-bounded organelle

GO:0043231 4 2205 678 1.0000 1236

0.0000

intracellular non-membrane-bounded organelle

GO:0043232 4 154 83 0.0000 55 1.0000

extracellular region GO:0005576 2 42 30 0.0000 12 1.0000 ribosome GO:0005840 4 34 25 0.0000 5 1.0000 protein-DNA complex GO:0032993 3 25 20 0.0000 5 1.0000 chromatin GO:0000785 5 27 21 0.0000 5 1.0000 intracellular organelle part GO:0044446 3 189 91 0.0000 73 1.0000 macromolecular complex GO:0032991 2 324 141 0.0063 134 1.0000 ribonucleoprotein complex GO:0030529 3 106 54 0.0063 37 1.0000 external encapsulating structure GO:0030312 3 21 15 0.0145 6 1.0000

MOLECULAR FUNCTION small conjugating protein ligase activity

GO:0019787 6 103 17 1.0000 77 0.0000

tetrapyrrole binding GO:0046906 3 145 82 0.0000 49 1.0000 purine nucleoside binding GO:0001883 4 887 369 0.0000 427 1.0000 iron ion binding GO:0005506 7 157 84 0.0000 56 1.0000 adenyl ribonucleotide binding GO:0032559 6 848 352 0.0000 411 1.0000 structural molecule activity GO:0005198 2 98 52 0.0038 33 1.0000 oxidoreductase activity GO:0016491 3 310 135 0.0063 119 1.0000 NADH dehydrogenase (quinone) activity

GO:0050136 6 39 23 0.0306 6 1.0000

45

Table S10: Inter-chromosomal duplicates in wheat chromosome 3B and related species.


# gene families*

# duplicated

gene families # genes

(interdup)

% inter-dup (% of total

families)

% inter-dup (% of total

genes) T. aestivum (3B) 3949 1321 2032 33.4% 34.1% B. distachyon (2) 3008 670 841 22.3% 24.3% O. sativa (1) 3053 659 859 21.6% 22.6% S. bicolor (3) 3087 646 824 20.9% 22.5% *OrthoMCL clusters of paralogs or orthologs when only considering the genes from wheat 3B, rice 1, B. distachyon 2, and sorghum 3.

Table S11: Intra-chromosomal duplicates in wheat chromosome 3B and related species.


# gene families*

# duplicated

gene families

# duplicated

genes

% duplicates

(out of total # families)

% duplicates

(out of total # genes)

Average (standard deviation) number of

duplicates per family

Maximum # duplicates per family

T. aestivum (3B) 3949 809 2216 20.5% 37.2% 2.7±2.0 31 B. distachyon (2) 3008 215 529 7.1% 15.3% 2.5±1.1 13 O. sativa (1) 3053 242 669 7.9% 17.6% 2.8±1.9 17 S. bicolor (3) 3087 241 592 7.8% 16.2% 2.5±1.2 15

Table S12: Tandem and dispersed duplicates in wheat chromosome 3B and related species.


# tandem duplicates

% tandem duplicates


% tandem duplicates

(out of # intrachr.

duplicates) # dispersed duplicates

% dispersed duplicates


% dispersed duplicates

(out of # intrachr.

duplicates) T. aestivum (3B) 1022 17.1% 46.1% 1194 20.0% 53.9% B. distachyon (2) 293 8.5% 55.4% 236 6.8% 44.6% O. sativa (1) 441 11.6% 65.9% 228 6.0% 34.1% S. bicolor (3) 324 8.9% 54.7% 268 7.3% 45.3%

46

Table S13: Characteristics of the QTL studies used in the metaQTL analysis.

Publication Population Population

type Population

size #

markers # QTL Groos et al. 2003 (108) Renan / Recital SSD F7 194 14 17 An et al. 2006 (109) Hanxuan10 / Lumai14 DH 120 29 2 Habash et al. 2007 (110) Chinese Spring / SQ1 DH 91 51 5 Laperche et al. 2007 (111) Arche / Recital DH 220 7 13 Li et al 2007 (112)* Chuan 35050 / Shannong 483 SSD F14 131 32 6 Fontaine et al. 2009 (113) Arche / Recital DH 137 / 166

/ 221 16 8

Zhang et al. 2011 (114) PH82-2 / Neixiang 188 SSD F6 240 7 3

Bennett et al. 2012 (115) Kukri / RAC875 DH 260 / 180 41 10 Bogard et al. 2011 (116) Toisondor / CF9107 DH 140 32 11 Guo et al. 2012 (117) Chuan 35050 / Shannong 483 SSD F16 131 45 14 Bogard et al. 2013 (118) Toisondor / Quebon

CF9107 / Quebon Toisondor / CF9107

DH DH DH

140 91 90

46 15

Liu et al. 2013 (119) Hanxuan10 x Lumai14 DH 150 29 11 J. Le Gouis et al. unpub. Apache / Ornicar DH 235 3 5 J. Le Gouis et al. unpub. Apache / Isengrain

CF9107 / Apache CF9107 / Isengrain

DH DH DH

83 161 83

21 1

SSD = Single Seed Descent, DH = Doubled-Haploid

*Li et al. (2007) QTLs were first projected on Guo et al. (117) map removing three

inconsistent markers (Xtrap4c, Xwmc291, Xwmc3) from the Li et al. map (112). The

resulting map and QTLs were used in the meta-analysis.

47

Table S14: Correspondence between 18 metaQTLs, the 3B sequence, the number of annotated genes and the number of markers. The numbers of

ISBP and SSR markers (designed in-silico on the 3B sequence) are indicated only for the 5 metaQTL showing a confidence interval smaller than

10 Mb.

Position of Confidence Interval on 3B consensus map (cM)

Position of Confidence Interval on 3B pseudomolecule (Mb)

Publication metaQTL ID MetaQTL position (cM)

start end size start end size no. Genes

no. ISBP markers

no. SSR markers

Griffiths et al. 2009 (120) 3B-1 12.97 1.60 24.34 22.74 4.1 41.3 37.2 809 Griffiths et al. 2009 (120) 3B-2 50.47 44.80 56.14 11.34 256.2 632.3 376.1 2,372 Mao et al. 2010 (121) P9 10.80 7.97 13.63 5.66 21.4 28.2 6.7 128 1,505 719 Mao et al. 2010 (121) F5 2.71 2.54 2.88 0.34 5.2 9.4 4.2 87 1,188 459 Mao et al. 2010 (121) F6 25.83 13.20 38.45 25.25 26.1 66.4 40.3 650 Zhang et al. 2010 (122) MQTL24 1.22 -0.53 2.96 3.49 0.0 9.6 9.6 266 2,754 1,295 Zhang et al. 2010 (122) MQTL25 30.22 17.91 42.52 24.61 34.7 135.9 101.2 1,152 Zhang et al. 2010 (122) MQTL26 44.80 44.80 44.80 0.00 223.5 388.3 164.8 690 Zhang et al. 2010 (122) MQTL27 23.03 1.26 44.80 43.54 1.9 388.3 386.4 3,223 Zhang et al. 2010 (122) MQTL28 66.85 42.07 91.63 49.56 116.8 737.0 620.3 4,713 Zhang et al. 2010 (122) MQTL29 105.31 106.60 104.02 -2.58 753.0 758.9 5.9 115 1,319 498 Quraishi et al. 2011 (123) MQTL-3 57.10 44.80 69.40 24.60 223.5 712.3 488.7 3,524 Griffiths et al. 2010 (124) QTL_height_3B_1 75.70 46.21 77.81 31.60 429.7 725.3 295.6 2,711 This study MQTL3B-1 7.71 3.93 11.50 7.57 14.2 28.2 14.0 300 This study MQTL3B-2 26.43 20.48 32.39 11.91 36.0 58.6 22.7 391 This study MQTL3B-3 43.95 43.05 44.86 1.81 150.8 398.1 247.3 1,303 This study MQTL3B-4 50.84 49.18 52.50 3.32 535.9 576.3 40.4 376 This study MQTL3B-5 79.16 77.66 80.66 3.00 725.1 726.6 1.5 23 351 160

48

Supplementary figures

Figure S1: Distribution of the density of the CRWs (Cereba) and Quinta LTR

retrotransposons along the pseudomolecule of chromosome 3B. The percentage of the

elements was calculated in sliding windows (length 10 Mb, step 1 Mb). Cereba (blue), Quinta

subvariant A (green), Quinta subvariant B (red).

49

Figure S2: Distribution of the recombination and the linkage disequilibrium along the

chromosome 3B pseudomolecule. Recombination rate (cM/Mb) was calculated using a sliding

window of 1 Mb and is represented in red; linkage disequilibrium is represented in blue. The

left axis represent the value of recombination while the right axis represents the LD values.

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

2

0

2

4

6

8

10

12

14

0 100000000 200000000 300000000 400000000 500000000 600000000 700000000

50

Figure S3: Distribution of the percentage of genes expressed in the different numbers of

experimental conditions (15 conditions tested). The expression of each of the 7,264 predicted

genes was analyzed and each gene was categorized as not expressed (0 conditions) or

expressed in 1 to 15 conditions.

51

A.

B.

Figure S4: Distribution and segmentation analysis of the expression breadth in the

barley and maize genomes.

52

A. Distribution of the average expression breadth along the 7 barley chromosomes (8

conditions from Mayer et al. (34)). B. Distribution of the average expression breadth along the

10 maize chromosomes (18 conditions described in Sekhon et al. (35)). The average

expression breadth was calculated in a sliding window of 10 Mb using a 1 Mb step.

53

A

B

Figure S5: Estimation of the insertion time (in million years) for the complete LTR-RTs

annotated on the chromosome 3B sequence. A) Distribution of the insertion dates estimated

for the complete set (21,619 elements). B) Examples of individual profile of insertion time for

54

4 families displaying contrasted patterns. The date of the maximal density of amplification is

indicated by the red number and line.

55

Figure S6: Distribution of the number of TEs from the three major superfamilies along the

3B pseudomolecule. Window size: 10 Mb; Step: 1 Mb. The gray highlighted areas represent

the R1 and R3 regions, whereas the hatched area represents the centromeric-pericentromeric

region. Gypsy (blue), Copia (green), CACTA (red)

56

Figure S7: TE distribution around coding sequences for the 10 most representative families

of transposable elements. Left graph: Gypsy (blue), copia (green), CACTA (red), unclassified

LTR retrotransposons (brown). Right graph: Unclassified TEs (light green), LINE (light blue),

Harbinger (magenta), unclassified DNA transposons with TIRs (orange), Mutator (purple),

Mariner (cyan). The gene is considered as a single point at position 0 and the average density

of TE was estimated in a window of 20 kb upstream and downstream of each of the gene

models.

57

Figure S8: Pairwise collinearity analyses between the genes located on orthologous

chromosomes in four grass species: wheat chromosome 3B (Ta3B), rice chromosome 1 (Os1),

Brachypodium chromosome 2 (Bd2), and sorghum chromosome 3 (Sb3). Each line between

chromosomes represents a block of collinearity comprised of at least 5 genes. The number of

collinear blocks followed by the total number of collinear genes is indicated within each

pairwise comparative circle.

7; 2489 12; 2304 10; 2372

117; 1532128; 1755124; 1603

58

Figure S9: Expression breadth of syntenic, nonsyntenic, intra- and inter-chromosomally

duplicated genes. Graphs represent the proportions of syntenic genes (black line), nonsyntenic

genes (red line), intra-chromosomal duplicates (blue line) and single copy genes (green line)

expressed in 1 to 15 different experimental conditions tested.

59

Figure S10: Distribution among the 6 wheat chromosome groups of putative ancestral

loci for 152 nonsyntenic genes of chromosome 3B for which a parental copy was

determined unambiguously. The expected proportion of genes was found using all the

genes from the annotation of the IWGSC chromosome survey contigs (19) except group 3.

The Chi squared equals 8.606 with 5 degrees of freedom (p= 0.1258), indicating there is

no significant difference from the expected distribution.

60

Figure S11: Relative abundance of TE superfamilies associated with syntenic and

nonsyntenic genes. For each of the major TE superfamilies (according to the 3-letter code

defined in Wicker et al. (125), the enrichment in TEs found in the 20 kb upstream and

downstream of the nonsyntenic genes was calculated based on the average proportion

observed around syntenic genes. Positive values indicate overrepresentation of TEs around

nonsyntenic compared to syntenic genes, and inversely. Only superfamilies representing more

than 0.1% of the 3B sequence were indicated in this histogram. Enrichment proportions (in %)

are indicated at the top of the histogram.

61

References 1. J. Dubcovsky, J. Dvorak, Genome plasticity a key factor in the success of polyploid wheat

under domestication. Science 316, 1862–1866 (2007). Medline doi:10.1126/science.1143986

2. F. Choulet, T. Wicker, C. Rustenholz, E. Paux, J. Salse, P. Leroy, S. Schlub, M. C. Le Paslier, G. Magdelenat, C. Gonthier, A. Couloux, H. Budak, J. Breen, M. Pumphrey, S. Liu, X. Kong, J. Jia, M. Gut, D. Brunel, J. A. Anderson, B. S. Gill, R. Appels, B. Keller, C. Feuillet, Megabase level sequencing reveals contrasted organization and evolution patterns of the wheat gene and transposable element spaces. Plant Cell 22, 1686–1701 (2010). Medline doi:10.1105/tpc.110.074187

3. J. Zhang, Evolution by gene duplication: An update. Trends Ecol. Evol. 18, 292–298 (2003). doi:10.1016/S0169-5347(03)00033-8

4. C. Rustenholz, F. Choulet, C. Laugier, J. Safár, H. Simková, J. Dolezel, F. Magni, S. Scalabrin, F. Cattonaro, S. Vautrin, A. Bellec, H. Bergès, C. Feuillet, E. Paux, A 3,000-loci transcription map of chromosome 3B unravels the structural and functional features of gene islands in hexaploid wheat. Plant Physiol. 157, 1596–1608 (2011). Medline doi:10.1104/pp.111.183921

5. I. D. Wilson, G. L. Barker, R. W. Beswick, S. K. Shepherd, C. Lu, J. A. Coghill, D. Edwards, P. Owen, R. Lyons, J. S. Parker, J. R. Lenton, M. J. Holdsworth, P. R. Shewry, K. J. Edwards, A transcriptomics resource for wheat functional genomics. Plant Biotechnol. J. 2, 495–506 (2004). Medline doi:10.1111/j.1467-7652.2004.00096.x

6. P. R. Bhat, A. Lukaszewski, X. Cui, J. Xu, J. T. Svensson, S. Wanamaker, J. G. Waines, T. J. Close, Mapping translocation breakpoints using a wheat microarray. Nucleic Acids Res. 35, 2936–2943 (2007). Medline doi:10.1093/nar/gkm148

7. L. L. Qi, B. Echalier, S. Chao, G. R. Lazo, G. E. Butler, O. D. Anderson, E. D. Akhunov, J. Dvorák, A. M. Linkiewicz, A. Ratnasiri, J. Dubcovsky, C. E. Bermudez-Kandianis, R. A. Greene, R. Kantety, C. M. La Rota, J. D. Munkvold, S. F. Sorrells, M. E. Sorrells, M. Dilbirligi, D. Sidhu, M. Erayman, H. S. Randhawa, D. Sandhu, S. N. Bondareva, K. S. Gill, A. A. Mahmoud, X. F. Ma, J. P. Miftahudin, E. J. Gustafson, V. Conley, J. L. Nduati, J. A. Gonzalez-Hernandez, J. H. Anderson, N. L. Peng, K. G. Lapitan, V. Hossain, S. F. Kalavacharla, M. S. Kianian, D. S. Pathan, H. T. Zhang, D. W. Nguyen, R. D. Choi, T. J. Fenton, P. E. Close, C. O. McGuire, B. S. Qualset, Gill, A chromosome bin map of 16,000 expressed sequence tag loci and distribution of genes among the three genomes of polyploid wheat. Genetics 168, 701–712 (2004). Medline doi:10.1534/genetics.104.034868

8. R. Brenchley, M. Spannagl, M. Pfeifer, G. L. Barker, R. D’Amore, A. M. Allen, N. McKenzie, M. Kramer, A. Kerhornou, D. Bolser, S. Kay, D. Waite, M. Trick, I. Bancroft, Y. Gu, N. Huo, M. C. Luo, S. Sehgal, B. Gill, S. Kianian, O. Anderson, P. Kersey, J. Dvorak, W. R. McCombie, A. Hall, K. F. Mayer, K. J. Edwards, M. W. Bevan, N. Hall, Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature 491, 705–710 (2012). Medline doi:10.1038/nature11650

62

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=17600208&dopt=Abstract

http://dx.doi.org/10.1126/science.1143986


http://dx.doi.org/10.1105/tpc.110.074187

http://dx.doi.org/10.1016/S0169-5347(03)00033-8


http://dx.doi.org/10.1104/pp.111.183921


http://dx.doi.org/10.1111/j.1467-7652.2004.00096.x


http://dx.doi.org/10.1093/nar/gkm148


http://dx.doi.org/10.1534/genetics.104.034868


http://dx.doi.org/10.1038/nature11650

9. J. Jia, S. Zhao, X. Kong, Y. Li, G. Zhao, W. He, R. Appels, M. Pfeifer, Y. Tao, X. Zhang, R. Jing, C. Zhang, Y. Ma, L. Gao, C. Gao, M. Spannagl, K. F. Mayer, D. Li, S. Pan, F. Zheng, Q. Hu, X. Xia, J. Li, Q. Liang, J. Chen, T. Wicker, C. Gou, H. Kuang, G. He, Y. Luo, B. Keller, Q. Xia, P. Lu, J. Wang, H. Zou, R. Zhang, J. Xu, J. Gao, C. Middleton, Z. Quan, G. Liu, J. Wang, H. Yang, X. Liu, Z. He, L. Mao, J. Wang; International Wheat Genome Sequencing Consortium, Aegilops tauschii draft genome sequence reveals a gene repertoire for wheat adaptation. Nature 496, 91–95 (2013). Medline doi:10.1038/nature12028

10. H. Q. Ling, S. Zhao, D. Liu, J. Wang, H. Sun, C. Zhang, H. Fan, D. Li, L. Dong, Y. Tao, C. Gao, H. Wu, Y. Li, Y. Cui, X. Guo, S. Zheng, B. Wang, K. Yu, Q. Liang, W. Yang, X. Lou, J. Chen, M. Feng, J. Jian, X. Zhang, G. Luo, Y. Jiang, J. Liu, Z. Wang, Y. Sha, B. Zhang, H. Wu, D. Tang, Q. Shen, P. Xue, S. Zou, X. Wang, X. Liu, F. Wang, Y. Yang, X. An, Z. Dong, K. Zhang, X. Zhang, M. C. Luo, J. Dvorak, Y. Tong, J. Wang, H. Yang, Z. Li, D. Wang, A. Zhang, J. Wang, Draft genome of the wheat A-genome progenitor Triticum urartu. Nature 496, 87–90 (2013). Medline doi:10.1038/nature11997

11. M. C. Schatz, A. L. Delcher, S. L. Salzberg, Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010). Medline doi:10.1101/gr.101360.109

12. V. Marx, Next-generation sequencing: The genome jigsaw. Nature 501, 263–268 (2013). Medline doi:10.1038/501261a

13. P. S. Schnable, D. Ware, R. S. Fulton, J. C. Stein, F. Wei, S. Pasternak, C. Liang, J. Zhang, L. Fulton, T. A. Graves, P. Minx, A. D. Reily, L. Courtney, S. S. Kruchowski, C. Tomlinson, C. Strong, K. Delehaunty, C. Fronick, B. Courtney, S. M. Rock, E. Belter, F. Du, K. Kim, R. M. Abbott, M. Cotton, A. Levy, P. Marchetto, K. Ochoa, S. M. Jackson, B. Gillam, W. Chen, L. Yan, J. Higginbotham, M. Cardenas, J. Waligorski, E. Applebaum, L. Phelps, J. Falcone, K. Kanchi, T. Thane, A. Scimone, N. Thane, J. Henke, T. Wang, J. Ruppert, N. Shah, K. Rotter, J. Hodges, E. Ingenthron, M. Cordes, S. Kohlberg, J. Sgro, B. Delgado, K. Mead, A. Chinwalla, S. Leonard, K. Crouse, K. Collura, D. Kudrna, J. Currie, R. He, A. Angelova, S. Rajasekar, T. Mueller, R. Lomeli, G. Scara, A. Ko, K. Delaney, M. Wissotski, G. Lopez, D. Campos, M. Braidotti, E. Ashley, W. Golser, H. Kim, S. Lee, J. Lin, Z. Dujmic, W. Kim, J. Talag, A. Zuccolo, C. Fan, A. Sebastian, M. Kramer, L. Spiegel, L. Nascimento, T. Zutavern, B. Miller, C. Ambroise, S. Muller, W. Spooner, A. Narechania, L. Ren, S. Wei, S. Kumari, B. Faga, M. J. Levy, L. McMahan, P. Van Buren, M. W. Vaughn, K. Ying, C. T. Yeh, S. J. Emrich, Y. Jia, A. Kalyanaraman, A. P. Hsia, W. B. Barbazuk, R. S. Baucom, T. P. Brutnell, N. C. Carpita, C. Chaparro, J. M. Chia, J. M. Deragon, J. C. Estill, Y. Fu, J. A. Jeddeloh, Y. Han, H. Lee, P. Li, D. R. Lisch, S. Liu, Z. Liu, D. H. Nagel, M. C. McCann, P. SanMiguel, A. M. Myers, D. Nettleton, J. Nguyen, B. W. Penning, L. Ponnala, K. L. Schneider, D. C. Schwartz, A. Sharma, C. Soderlund, N. M. Springer, Q. Sun, H. Wang, M. Waterman, R. Westerman, T. K. Wolfgruber, L. Yang, Y. Yu, L. Zhang, S. Zhou, Q. Zhu, J. L. Bennetzen, R. K. Dawe, J. Jiang, N. Jiang, G. G. Presting, S. R. Wessler, S. Aluru, R. A. Martienssen, S. W. Clifton, W. R. McCombie, R. A. Wing, R. K. Wilson, The B73 maize genome: Complexity, diversity, and dynamics. Science 326, 1112–1115 (2009). Medline doi:10.1126/science.1178534

63






http://dx.doi.org/10.1101/gr.101360.109



http://dx.doi.org/10.1038/501261a



14. X. Xu, S. Pan, S. Cheng, B. Zhang, D. Mu, P. Ni, G. Zhang, S. Yang, R. Li, J. Wang, G. Orjeda, F. Guzman, M. Torres, R. Lozano, O. Ponce, D. Martinez, G. De la Cruz, S. K. Chakrabarti, V. U. Patil, K. G. Skryabin, B. B. Kuznetsov, N. V. Ravin, T. V. Kolganova, A. V. Beletsky, A. V. Mardanov, A. Di Genova, D. M. Bolser, D. M. Martin, G. Li, Y. Yang, H. Kuang, Q. Hu, X. Xiong, G. J. Bishop, B. Sagredo, N. Mejía, W. Zagorski, R. Gromadka, J. Gawor, P. Szczesny, S. Huang, Z. Zhang, C. Liang, J. He, Y. Li, Y. He, J. Xu, Y. Zhang, B. Xie, Y. Du, D. Qu, M. Bonierbale, M. Ghislain, M. R. Herrera, G. Giuliano, M. Pietrella, G. Perrotta, P. Facella, K. O’Brien, S. E. Feingold, L. E. Barreiro, G. A. Massa, L. Diambra, B. R. Whitty, B. Vaillancourt, H. Lin, A. N. Massa, M. Geoffroy, S. Lundback, D. DellaPenna, C. R. Buell, S. K. Sharma, D. F. Marshall, R. Waugh, G. J. Bryan, M. Destefanis, I. Nagy, D. Milbourne, S. J. Thomson, M. Fiers, J. M. Jacobs, K. L. Nielsen, M. Sønderkær, M. Iovene, G. A. Torres, J. Jiang, R. E. Veilleux, C. W. Bachem, J. de Boer, T. Borm, B. Kloosterman, H. van Eck, E. Datema, B. Hekkert, A. Goverse, R. C. van Ham, R. G. Visser; Potato Genome Sequencing Consortium, Genome sequence and analysis of the tuber crop potato. Nature 475, 189–195 (2011). Medline doi:10.1038/nature10158

15. J. Doležel, M. Kubaláková, E. Paux, J. Bartos, C. Feuillet, Chromosome-based genomics in the cereals. Chromosome Res. 15, 51–66 (2007). Medline doi:10.1007/s10577-006-1106-x

16. J. Safár, J. Bartos, J. Janda, A. Bellec, M. Kubaláková, M. Valárik, S. Pateyron, J. Weiserová, R. Tusková, J. Cíhalíková, J. Vrána, H. Simková, P. Faivre-Rampant, P. Sourdille, M. Caboche, M. Bernard, J. Dolezel, B. Chalhoub, Dissecting large and complex genomes: Flow sorting and BAC cloning of individual chromosomes from bread wheat. Plant J. 39, 960–968 (2004). Medline doi:10.1111/j.1365-313X.2004.02179.x

17. E. Paux, P. Sourdille, J. Salse, C. Saintenac, F. Choulet, P. Leroy, A. Korol, M. Michalak, S. Kianian, W. Spielmeyer, E. Lagudah, D. Somers, A. Kilian, M. Alaux, S. Vautrin, H. Bergès, K. Eversole, R. Appels, J. Safar, H. Simkova, J. Dolezel, M. Bernard, C. Feuillet, A physical map of the 1-gigabase bread wheat chromosome 3B. Science 322, 101–104 (2008). Medline doi:10.1126/science.1161847

18. Supplementary materials are available on Science Online.

19. International Wheat Genome Sequencing Consortium, A chromosome-based draft sequence of the hexaploid (Triticum aestivum) bread wheat genome. Science 345, 1251788 (2014). doi:10.1126/science.1251788

20. B. S. Gill, B. Friebe, T. Endo, Standard karyotype and nomenclature system for description of chromosome bands and structural aberrations in wheat (Triticum aestivum). Genome 34, 830–839 (1991). doi:10.1139/g91-128

21. P. S. Chain, D. V. Grafham, R. S. Fulton, M. G. Fitzgerald, J. Hostetler, D. Muzny, J. Ali, B. Birren, D. C. Bruce, C. Buhay, J. R. Cole, Y. Ding, S. Dugan, D. Field, G. M. Garrity, R. Gibbs, T. Graves, C. S. Han, S. H. Harrison, S. Highlander, P. Hugenholtz, H. M. Khouri, C. D. Kodira, E. Kolker, N. C. Kyrpides, D. Lang, A. Lapidus, S. A. Malfatti, V. Markowitz, T. Metha, K. E. Nelson, J. Parkhill, S. Pitluck, X. Qin, T. D. Read, J. Schmutz, S. Sozhamannan, P. Sterk, R. L. Strausberg, G. Sutton, N. R. Thomson, J. M. Tiedje, G. Weinstock, A. Wollam, J. C. Detter; Genomic Standards Consortium Human

64




http://dx.doi.org/10.1007/s10577-006-1106-x

http://dx.doi.org/10.1007/s10577-006-1106-x


http://dx.doi.org/10.1111/j.1365-313X.2004.02179.x



http://dx.doi.org/10.1139/g91-128

Microbiome Project Jumpstart Consortium, Genomics. Genome project standards in a new era of sequencing. Science 326, 236–237 (2009). Medline doi:10.1126/science.1180614

22. B. Li, F. Choulet, Y. Heng, W. Hao, E. Paux, Z. Liu, W. Yue, W. Jin, C. Feuillet, X. Zhang, Wheat centromeric retrotransposons: The new ones take a major role in centromeric structure. Plant J. 73, 952–965 (2013). Medline doi:10.1111/tpj.12086

23. H. Yan, P. B. Talbert, H. R. Lee, J. Jett, S. Henikoff, F. Chen, J. Jiang, Intergenic locations of rice centromeric chromatin. PLOS Biol. 6, e286 (2008). Medline doi:10.1371/journal.pbio.0060286

24. H. Zhang, R. K. Dawe, Total centromere size and genome size are strongly correlated in ten grass species. Chromosome Res. 20, 403–412 (2012). Medline doi:10.1007/s10577-012-9284-1

25. T. Sakuno, K. Tada, Y. Watanabe, Kinetochore geometry defined by cohesion within the centromere. Nature 458, 852–858 (2009). Medline doi:10.1038/nature07876

26. Q. Pan, F. Ali, X. Yang, J. Li, J. Yan, Exploring the genetic characteristics of two recombinant inbred line populations via high-density SNP markers in maize. PLOS ONE 7, e52777 (2012). Medline doi:10.1371/journal.pone.0052777

27. A. J. Lukaszewski, C. A. Curtis, Physical distribution of recombination in B-genome chromosomes of tetraploid wheat. Theor. Appl. Genet. 86, 121–127 (1993). Medline doi:10.1007/BF00223816

28. C. Saintenac, M. Falque, O. C. Martin, E. Paux, C. Feuillet, P. Sourdille, Detailed recombination studies along chromosome 3B provide new insights on crossover distribution in wheat (Triticum aestivum L.). Genetics 181, 393–403 (2009). Medline doi:10.1534/genetics.108.097469

29. J. Evans, R. F. McCormick, D. Morishige, S. N. Olson, B. Weers, J. Hilley, P. Klein, W. Rooney, J. Mullet, Extensive variation in the density and distribution of DNA polymorphism in sorghum genomes. PLOS ONE 8, e79192 (2013). Medline doi:10.1371/journal.pone.0079192

30. International Rice Genome Sequencing Project, The map-based sequence of the rice genome. Nature 436, 793–800 (2005). Medline doi:10.1038/nature03895

31. A. H. Paterson, J. E. Bowers, R. Bruggmann, I. Dubchak, J. Grimwood, H. Gundlach, G. Haberer, U. Hellsten, T. Mitros, A. Poliakov, J. Schmutz, M. Spannagl, H. Tang, X. Wang, T. Wicker, A. K. Bharti, J. Chapman, F. A. Feltus, U. Gowik, I. V. Grigoriev, E. Lyons, C. A. Maher, M. Martis, A. Narechania, R. P. Otillar, B. W. Penning, A. A. Salamov, Y. Wang, L. Zhang, N. C. Carpita, M. Freeling, A. R. Gingle, C. T. Hash, B. Keller, P. Klein, S. Kresovich, M. C. McCann, R. Ming, D. G. Peterson, D. Mehboob-ur-Rahman, P. Ware, K. F. Westhoff, J. Mayer, D. S. Messing, Rokhsar, The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551–556 (2009). Medline doi:10.1038/nature07723

32. A. Gottlieb, H. G. Müller, A. N. Massa, H. Wanjugi, K. R. Deal, F. M. You, X. Xu, Y. Q. Gu, M. C. Luo, O. D. Anderson, A. P. Chan, P. Rabinowicz, K. M. Devos, J. Dvorak,

65




http://dx.doi.org/10.1111/tpj.12086


http://dx.doi.org/10.1371/journal.pbio.0060286


http://dx.doi.org/10.1007/s10577-012-9284-1

http://dx.doi.org/10.1007/s10577-012-9284-1




http://dx.doi.org/10.1371/journal.pone.0052777


http://dx.doi.org/10.1007/BF00223816









Insular organization of gene space in grass genomes. PLOS ONE 8, e54101 (2013). Medline doi:10.1371/journal.pone.0054101

33. M. W. Ganal, G. Durstewitz, A. Polley, A. Bérard, E. S. Buckler, A. Charcosset, J. D. Clarke, E. M. Graner, M. Hansen, J. Joets, M. C. Le Paslier, M. D. McMullen, P. Montalent, M. Rose, C. C. Schön, Q. Sun, H. Walter, O. C. Martin, M. Falque, A large maize (Zea mays L.) SNP genotyping array: Development and germplasm genotyping, and genetic mapping to compare with the B73 reference genome. PLOS ONE 6, e28334 (2011). Medline doi:10.1371/journal.pone.0028334

34. K. F. Mayer, R. Waugh, J. W. Brown, A. Schulman, P. Langridge, M. Platzer, G. B. Fincher, G. J. Muehlbauer, K. Sato, T. J. Close, R. P. Wise, N. Stein; International Barley Genome Sequencing Consortium, A physical, genetic and functional sequence assembly of the barley genome. Nature 491, 711–716 (2012). Medline

35. R. S. Sekhon, R. Briskine, C. N. Hirsch, C. L. Myers, N. M. Springer, C. R. Buell, N. de Leon, S. M. Kaeppler, Maize gene atlas developed by RNA sequencing and comparative evaluation of transcriptomes based on RNA sequencing and microarrays. PLOS ONE 8, e61005 (2013). Medline doi:10.1371/journal.pone.0061005

36. R. S. Baucom, J. C. Estill, C. Chaparro, N. Upshaw, A. Jogi, J. M. Deragon, R. P. Westerman, P. J. Sanmiguel, J. L. Bennetzen, Exceptional diversity, non-random distribution, and rapid evolution of retroelements in the B73 maize genome. PLOS Genet. 5, e1000732 (2009). Medline doi:10.1371/journal.pgen.1000732

37. R. S. Baucom, J. C. Estill, J. Leebens-Mack, J. L. Bennetzen, Natural selection on gene function drives the evolution of LTR retrotransposon families in the rice genome. Genome Res. 19, 243–254 (2009). Medline doi:10.1101/gr.083360.108

38. M. Charles, H. Belcram, J. Just, C. Huneau, A. Viollet, A. Couloux, B. Segurens, M. Carter, V. Huteau, O. Coriton, R. Appels, S. Samain, B. Chalhoub, Dynamics and differential proliferation of transposable elements during the evolution of the B and A genomes of wheat. Genetics 180, 1071–1086 (2008). Medline doi:10.1534/genetics.108.092304

39. E. M. Sergeeva, E. A. Salina, I. G. Adonina, B. Chalhoub, Evolutionary analysis of the CACTA DNA-transposon Caspar across wheat species using sequence comparison and in situ hybridization. Mol. Genet. Genomics 284, 11–23 (2010). Medline doi:10.1007/s00438-010-0544-5

40. C. Lu, J. Chen, Y. Zhang, Q. Hu, W. Su, H. Kuang, Miniature inverted-repeat transposable elements (MITEs) have been accumulated through amplification bursts and play important roles in gene expression and species diversity in Oryza sativa. Mol. Biol. Evol. 29, 1005–1017 (2012). Medline doi:10.1093/molbev/msr282

41. M. D. Gale, K. M. Devos, Comparative genetics in the grasses. Proc. Natl. Acad. Sci. U.S.A. 95, 1971–1974 (1998). Medline doi:10.1073/pnas.95.5.1971

42. K. M. Devos, M. D. Gale, Genome relationships: The grass model in current research. Plant Cell 12, 637–646 (2000). Medline doi:10.1105/tpc.12.5.637

43. F. Murat, J. H. Xu, E. Tannier, M. Abrouk, N. Guilhot, C. Pont, J. Messing, J. Salse, Ancestral grass karyotype reconstruction unravels new mechanisms of genome shuffling

66










http://dx.doi.org/10.1371/journal.pgen.1000732


http://dx.doi.org/10.1101/gr.083360.108




http://dx.doi.org/10.1007/s00438-010-0544-5


http://dx.doi.org/10.1093/molbev/msr282


http://dx.doi.org/10.1073/pnas.95.5.1971


http://dx.doi.org/10.1105/tpc.12.5.637

as a source of plant evolution. Genome Res. 20, 1545–1557 (2010). Medline doi:10.1101/gr.109744.110

44. International Brachypodium Initiative, Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature 463, 763–768 (2010). Medline doi:10.1038/nature08747

45. E. D. Akhunov, A. R. Akhunova, A. M. Linkiewicz, J. Dubcovsky, D. Hummel, G. R. Lazo, S. Chao, O. D. Anderson, J. David, L. Qi, B. Echalier, B. S. Gill, J. P. Miftahudin, M. Gustafson, M. E. La Rota, D. Sorrells, H. T. Zhang, V. Nguyen, K. Kalavacharla, S. F. Hossain, J. Kianian, N. L. Peng, E. J. Lapitan, V. Wennerlind, J. A. Nduati, D. Anderson, K. S. Sidhu, P. E. Gill, C. O. McGuire, J. Qualset, J. Dvorak, Synteny perturbations between wheat homoeologous chromosomes caused by locus duplications and deletions correlate with recombination rates. Proc. Natl. Acad. Sci. U.S.A. 100, 10836–10841 (2003). Medline doi:10.1073/pnas.1934431100

46. T. Wicker, K. F. Mayer, H. Gundlach, M. Martis, B. Steuernagel, U. Scholz, H. Simková, M. Kubaláková, F. Choulet, S. Taudien, M. Platzer, C. Feuillet, T. Fahima, H. Budak, J. Dolezel, B. Keller, N. Stein, Frequent gene movement and pseudogene evolution is common to the large and complex genomes of wheat, barley, and their relatives. Plant Cell 23, 1706–1718 (2011). Medline doi:10.1105/tpc.111.086629

47. T. Wicker, J. P. Buchmann, B. Keller, Patching gaps in plant genomes results in gene movement and erosion of colinearity. Genome Res. 20, 1229–1237 (2010). Medline doi:10.1101/gr.107284.110

48. M. Morgante, S. Brunner, G. Pea, K. Fengler, A. Zuccolo, A. Rafalski, Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize. Nat. Genet. 37, 997–1002 (2005). Medline doi:10.1038/ng1615

49. C. Feuillet, J. E. Leach, J. Rogers, P. S. Schnable, K. Eversole, Crop genome sequencing: Lessons and rationales. Trends Plant Sci. 16, 77–88 (2011). Medline doi:10.1016/j.tplants.2010.10.005

50. E. Paux, S. Faure, F. Choulet, D. Roger, V. Gauthier, J. P. Martinant, P. Sourdille, F. Balfourier, M. C. Le Paslier, A. Chauveau, M. Cakir, B. Gandon, C. Feuillet, Insertion site-based polymorphism markers open new perspectives for genome saturation and marker-assisted selection in wheat. Plant Biotechnol. J. 8, 196–210 (2010). Medline doi:10.1111/j.1467-7652.2009.00477.x

51. B. Goffinet, S. Gerber, Quantitative trait loci: A meta-analysis. Genetics 155, 463–473 (2000). Medline

52. J. A. Foley, N. Ramankutty, K. A. Brauman, E. S. Cassidy, J. S. Gerber, M. Johnston, N. D. Mueller, C. O’Connell, D. K. Ray, P. C. West, C. Balzer, E. M. Bennett, S. R. Carpenter, J. Hill, C. Monfreda, S. Polasky, J. Rockström, J. Sheehan, S. Siebert, D. Tilman, D. P. Zaks, Solutions for a cultivated planet. Nature 478, 337–342 (2011). Medline doi:10.1038/nature10452

53. P. Leroy, N. Guilhot, H. Sakai, A. Bernard, F. Choulet, S. Theil, S. Reboux, N. Amano, T. Flutre, C. Pelegrin, H. Ohyanagi, M. Seidel, F. Giacomoni, M. Reichstadt, M. Alaux, E. Gicquello, F. Legeai, L. Cerutti, H. Numa, T. Tanaka, K. Mayer, T. Itoh, H. Quesneville,

67


http://dx.doi.org/10.1101/gr.109744.110




http://dx.doi.org/10.1073/pnas.1934431100


http://dx.doi.org/10.1105/tpc.111.086629


http://dx.doi.org/10.1101/gr.107284.110


http://dx.doi.org/10.1038/ng1615


http://dx.doi.org/10.1016/j.tplants.2010.10.005


http://dx.doi.org/10.1111/j.1467-7652.2009.00477.x




C. Feuillet, TriAnnot: A versatile and high performance pipeline for the automated annotation of plant genomes. Front. Plant Sci. 3, 5 (2012). Medline doi:10.3389/fpls.2012.00005

54. J. Chen, A. K. Gupta, Parametric Statistical Change Point Analysis: With Applications to Genetics, Medicine, and Finance (Birkhäuser, Basel, 2012).

55. L. Li, C. J. Stoeckert Jr., D. S. Roos, OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003). Medline doi:10.1101/gr.1224503

56. J. D. Thompson, D. G. Higgins, T. J. Gibson, CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994). Medline doi:10.1093/nar/22.22.4673

57. Z. Yang, PAML: A program package for phylogenetic analysis by maximum likelihood. CABIOS 13, 555–556 (1997). Medline

58. H. Šimková, J. T. Svensson, P. Condamine, E. Hribová, P. Suchánková, P. R. Bhat, J. Bartos, J. Safár, T. J. Close, J. Dolezel, Coupling amplified DNA from flow-sorted chromosomes to high-density SNP mapping in barley. BMC Genomics 9, 294 (2008). Medline doi:10.1186/1471-2164-9-294

59. R. Li, C. Yu, Y. Li, T. W. Lam, S. M. Yiu, K. Kristiansen, J. Wang, SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009). Medline doi:10.1093/bioinformatics/btp336

60. S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997). Medline doi:10.1093/nar/25.17.3389

61. J. M. Aury, C. Cruaud, V. Barbe, O. Rogier, S. Mangenot, G. Samson, J. Poulain, V. Anthouard, C. Scarpelli, F. Artiguenave, P. Wincker, High quality draft sequences for prokaryotic genomes using a mix of new sequencing technologies. BMC Genomics 9, 603 (2008). Medline doi:10.1186/1471-2164-9-603

62. R. Philippe, F. Choulet, E. Paux, J. van Oeveren, J. Tang, A. H. Wittenberg, A. Janssen, M. J. van Eijk, K. Stormo, A. Alberti, P. Wincker, E. Akhunov, E. van der Vossen, C. Feuillet, Whole Genome Profiling provides a robust framework for physical mapping and sequencing in the highly complex and repetitive wheat genome. BMC Genomics 13, 47 (2012). Medline doi:10.1186/1471-2164-13-47

63. W. J. Kent, BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002). Medline doi:10.1101/gr.229202. Article published online before March 2002

64. C. Soderlund, S. Humphray, A. Dunham, L. French, Contigs built with fingerprints, markers, and FPC V4.7. Genome Res. 10, 1772–1787 (2000). Medline doi:10.1101/gr.GR-1375R

65. A. Graner, H. Siedler, A. Jahoor, R. G. Herrmann, G. Wenzel, Assessment of the degree and the type of restriction fragment length polymorphism in barley (Hordeum vulgare). Theor. Appl. Genet. 80, 826–832 (1990). Medline doi:10.1007/BF00224200

68


http://dx.doi.org/10.3389/fpls.2012.00005


http://dx.doi.org/10.1101/gr.1224503



http://dx.doi.org/10.1093/nar/22.22.4673



http://dx.doi.org/10.1186/1471-2164-9-294


http://dx.doi.org/10.1093/bioinformatics/btp336




http://dx.doi.org/10.1186/1471-2164-9-603


http://dx.doi.org/10.1186/1471-2164-13-47



http://dx.doi.org/10.1101/gr.229202.%20Article%20published%20online%20before%20March%202002


http://dx.doi.org/10.1101/gr.GR-1375R


http://dx.doi.org/10.1007/BF00224200

66. M. Stanke, O. Keller, I. Gunduz, A. Hayes, S. Waack, B. Morgenstern, AUGUSTUS: Ab initio prediction of alternative transcripts. Nucleic Acids Res. 34 (Web Server), W435–W439 (2006). Medline doi:10.1093/nar/gkl200

67. E. Blanco, J. F. Abril, Computational gene annotation in new genome assemblies using GeneID. Methods Mol. Biol. 537, 243–261 (2009). Medline doi:10.1007/978-1-59745-251-9_12

68. K. Rutherford, J. Parkhill, J. Crook, T. Horsnell, P. Rice, M. A. Rajandream, B. Barrell, Artemis: Sequence visualization and annotation. Bioinformatics 16, 944–945 (2000). Medline doi:10.1093/bioinformatics/16.10.944

69. T. Schiex, A. Moisan, P. Rouze, in Computational Biology, O. Gascuel, M.-F. Sagot, Eds., LNCS 2066 (Springer-Verlag, Berline Heidelberg, 2001), pp. 111–125.

70. G. S. Slater, E. Birney, Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005). Medline doi:10.1186/1471-2105-6-31

71. M. Van Bel, S. Proost, E. Wischnitzki, S. Movahedi, C. Scheerlinck, Y. Van de Peer, K. Vandepoele, Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol. 158, 590–600 (2012). Medline doi:10.1104/pp.111.189514

72. W. Lin, Y. Chen, J. Ho, C. Hsiao, GOBU: Toward an integration interface for biological object. J. Information Sci. Eng. 22, 19–30 (2006); http://www.iis.sinica.edu.tw/papers/hoho/5342-F.pdf

73. S. G. Jantzen, B. J. Sutherland, D. R. Minkley, B. F. Koop, GO trimming: Systematically reducing redundancy in large Gene Ontology datasets. BMC Res. Notes 4, 267 (2011). Medline doi:10.1186/1756-0500-4-267

74. G. van Ooijen, G. Mayr, M. M. Kasiem, M. Albrecht, B. J. Cornelissen, F. L. Takken, Structure-function analysis of the NB-ARC domain of plant disease resistance proteins. J. Exp. Bot. 59, 1383–1397 (2008). Medline doi:10.1093/jxb/ern045

75. K. Lagesen, P. Hallin, E. A. Rødland, H. H. Staerfeldt, T. Rognes, D. W. Ussery, RNAmmer: Consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108 (2007). Medline doi:10.1093/nar/gkm160

76. T. M. Lowe, S. R. Eddy, tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997). Medline doi:10.1093/nar/25.5.0955

77. S. Connelly, C. Marshallsay, D. Leader, J. W. Brown, W. Filipowicz, Small nuclear RNA genes transcribed by either RNA polymerase II or RNA polymerase III in monocot plants share three promoter elements and use a strategy to regulate gene expression different from that used by their dicot plant counterparts. Mol. Cell. Biol. 14, 5910–5919 (1994). Medline doi:10.1128/MCB.14.9.5910

78. A. J. Enright, S. Van Dongen, C. A. Ouzounis, An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002). Medline doi:10.1093/nar/30.7.1575

69


http://dx.doi.org/10.1093/nar/gkl200


http://dx.doi.org/10.1007/978-1-59745-251-9_12

http://dx.doi.org/10.1007/978-1-59745-251-9_12



http://dx.doi.org/10.1093/bioinformatics/16.10.944


http://dx.doi.org/10.1186/1471-2105-6-31


http://dx.doi.org/10.1104/pp.111.189514

http://www.iis.sinica.edu.tw/papers/hoho/5342-F.pdf



http://dx.doi.org/10.1186/1756-0500-4-267


http://dx.doi.org/10.1093/jxb/ern045


http://dx.doi.org/10.1093/nar/gkm160





http://dx.doi.org/10.1128/MCB.14.9.5910



79. K. Katoh, K. Kuma, H. Toh, T. Miyata, MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33, 511–518 (2005). Medline doi:10.1093/nar/gki198

80. T. Flutre, E. Duprat, C. Feuillet, H. Quesneville, Considering transposable element diversification in de novo annotation approaches. PLOS ONE 6, e16526 (2011). Medline doi:10.1371/journal.pone.0016526

81. D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, S. L. Salzberg, TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013). Medline doi:10.1186/gb-2013-14-4-r36

82. B. Langmead, S. L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012). Medline doi:10.1038/nmeth.1923

83. H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin; 1000 Genome Project Data Processing Subgroup, The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). Medline doi:10.1093/bioinformatics/btp352

84. C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, L. Pachter, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010). Medline doi:10.1038/nbt.1621

85. A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, B. Wold, Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008). Medline doi:10.1038/nmeth.1226

86. S. Shen, J. W. Park, J. Huang, K. A. Dittmar, Z. X. Lu, Q. Zhou, R. P. Carstens, Y. Xing, MATS: A Bayesian framework for flexible detection of differential alternative splicing from RNA-Seq data. Nucleic Acids Res. 40, e61 (2012). Medline doi:10.1093/nar/gkr1291

87. T. J. Carver, K. M. Rutherford, M. Berriman, M. A. Rajandream, B. G. Barrell, J. Parkhill, ACT: The Artemis Comparison Tool. Bioinformatics 21, 3422–3423 (2005). Medline doi:10.1093/bioinformatics/bti553

88. Y. Wang, H. Tang, J. D. Debarry, X. Tan, J. Li, X. Wang, T. H. Lee, H. Jin, B. Marler, H. Guo, J. C. Kissinger, A. H. Paterson, MCScanX: A toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49 (2012). Medline doi:10.1093/nar/gkr1293

89. M. Krzywinski, J. Schein, I. Birol, J. Connors, R. Gascoyne, D. Horsman, S. J. Jones, M. A. Marra, Circos: An information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009). Medline doi:10.1101/gr.092759.109

90. E. R. Sears, The aneuploid of common wheat. Mo. Agric. Exp. Sta. Res. Bull. 572, 1–59 (1954).

91. E. R. Sears, L. Sears, in Proc. 5th Int. Wheat Genetics Symp., S. Ramanujams, Ed. (Indian Agricultural Research Institute, New Delhi, India., 1978), pp. 389-407.

70


http://dx.doi.org/10.1093/nar/gki198




http://dx.doi.org/10.1186/gb-2013-14-4-r36


http://dx.doi.org/10.1038/nmeth.1923


http://dx.doi.org/10.1093/bioinformatics/btp352


http://dx.doi.org/10.1038/nbt.1621


http://dx.doi.org/10.1038/nmeth.1226


http://dx.doi.org/10.1093/nar/gkr1291


http://dx.doi.org/10.1093/bioinformatics/bti553


http://dx.doi.org/10.1093/nar/gkr1293


http://dx.doi.org/10.1101/gr.092759.109

92. T. R. Endo, B. S. Gill, The deletion stocks of common wheat. J. Hered. 87, 295–307 (1996). doi:10.1093/oxfordjournals.jhered.a023003

93. A. Graner, H. Siedler, A. Jahoor, R. G. Herrmann, G. Wenzel, Assessment of the degree and the type of restriction fragment length polymorphism in barley (Hordeum vulgare). Theor. Appl. Genet. 80, 826–832 (1990). Medline doi:10.1007/BF00224200

94. F. Balfourier, V. Roussel, P. Strelchenko, F. Exbrayat-Vinson, P. Sourdille, G. Boutet, J. Koenig, C. Ravel, O. Mitrofanova, M. Beckert, G. Charmet, A worldwide bread wheat core collection arrayed in a 384-well plate. Theor. Appl. Genet. 114, 1265–1275 (2007). Medline doi:10.1007/s00122-007-0517-1

95. A. Horvath, A. Didier, J. Koenig, F. Exbrayat, G. Charmet, F. Balfourier, Analysis of diversity and linkage disequilibrium along chromosome 3B of bread wheat (Triticum aestivum L.). Theor. Appl. Genet. 119, 1523–1537 (2009). Medline doi:10.1007/s00122-009-1153-8

96. S. de Givry, M. Bouchez, P. Chabrier, D. Milan, T. Schiex, CARHTA GENE: Multipopulation integrated genetic and radiation hybrid mapping. Bioinformatics 21, 1703–1704 (2005). Medline doi:10.1093/bioinformatics/bti222

97. K. C. Cone, M. D. McMullen, I. V. Bi, G. L. Davis, Y. S. Yim, J. M. Gardiner, M. L. Polacco, H. Sanchez-Villeda, Z. Fang, S. G. Schroeder, S. A. Havermann, J. E. Bowers, A. H. Paterson, C. A. Soderlund, F. W. Engler, R. A. Wing, E. H. Coe Jr., Genetic, physical, and informatics resources for maize. On the road to an integrated map. Plant Physiol. 130, 1598–1605 (2002). Medline doi:10.1104/pp.012245

98. P. Wenzl, P. Suchánková, J. Carling, H. Simková, E. Huttner, M. Kubaláková, P. Sourdille, E. Paul, C. Feuillet, A. Kilian, J. Dolezel, Isolated chromosomes as a new and efficient source of DArT markers for the saturation of genetic maps. Theor. Appl. Genet. 121, 465–474 (2010). Medline doi:10.1007/s00122-010-1323-8

99. M. E. Sorrells, J. P. Gustafson, D. Somers, S. Chao, D. Benscher, G. Guedira-Brown, E. Huttner, A. Kilian, P. E. McGuire, K. Ross, J. Tanaka, P. Wenzl, K. Williams, C. O. Qualset, A. Van Deynze, Reconstruction of the synthetic W7984 x Opata M85 wheat reference population. Genome 54, 875–882 (2011). Medline doi:10.1139/g11-054

100. P. Sourdille, T. Cadalen, H. Guyomarc’h, J. W. Snape, M. R. Perretant, G. Charmet, C. Boeuf, S. Bernard, M. Bernard, An update of the Courtot x Chinese Spring intervarietal molecular marker linkage map for the QTL detection of agronomic traits in wheat. Theor. Appl. Genet. 106, 530–538 (2003). Medline

101. T. W. Banks, M. C. Jordan, D. J. Somers, Single-Feature Polymorphism Mapping in Bread Wheat. Plant Gen. 2, 167 (2009). doi:10.3835/plantgenome2009.02.0009

102. P. A. Wilkinson, M. O. Winfield, G. L. Barker, A. M. Allen, A. Burridge, J. A. Coghill, K. J. Edwards, CerealsDB 2.0: An integrated resource for plant breeders and scientists. BMC Bioinformatics 13, 219 (2012). Medline doi:10.1186/1471-2105-13-219

103. A. M. Allen, G. L. Barker, P. Wilkinson, A. Burridge, M. Winfield, J. Coghill, C. Uauy, S. Griffiths, P. Jack, S. Berry, P. Werner, J. P. Melichar, J. McDougall, R. Gwilliam, P. Robinson, K. J. Edwards, Discovery and development of exome-based, co-dominant

71

http://dx.doi.org/10.1093/oxfordjournals.jhered.a023003


http://dx.doi.org/10.1007/BF00224200



http://dx.doi.org/10.1007/s00122-007-0517-1


http://dx.doi.org/10.1007/s00122-009-1153-8

http://dx.doi.org/10.1007/s00122-009-1153-8


http://dx.doi.org/10.1093/bioinformatics/bti222


http://dx.doi.org/10.1104/pp.012245


http://dx.doi.org/10.1007/s00122-010-1323-8


http://dx.doi.org/10.1139/g11-054


http://dx.doi.org/10.3835/plantgenome2009.02.0009


http://dx.doi.org/10.1186/1471-2105-13-219

single nucleotide polymorphism markers in hexaploid wheat (Triticum aestivum L.). Plant Biotechnol. J. 11, 279–295 (2013). Medline doi:10.1111/pbi.12009

104. M. O. Winfield, P. A. Wilkinson, A. M. Allen, G. L. Barker, J. A. Coghill, A. Burridge, A. Hall, R. C. Brenchley, R. D’Amore, N. Hall, M. W. Bevan, T. Richmond, D. J. Gerhardt, J. A. Jeddeloh, K. J. Edwards, Targeted re-sequencing of the allohexaploid wheat exome. Plant Biotechnol. J. 10, 733–742 (2012). Medline doi:10.1111/j.1467-7652.2012.00713.x

105. P. J. Bradbury, Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss, E. S. Buckler, TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics 23, 2633–2635 (2007). Medline doi:10.1093/bioinformatics/btm308

106. O. Sosnowski, A. Charcosset, J. Joets, BioMercator V3: An upgrade of genetic map compilation and quantitative trait loci meta-analysis algorithms. Bioinformatics 28, 2082–2083 (2012). Medline doi:10.1093/bioinformatics/bts313

107. J. C. Zadoks, T. T. Chang, C. F. Konzak, A decimal code for the growth stages of cereals. Weed Res. 14, 415–421 (1974). doi:10.1111/j.1365-3180.1974.tb01084.x

108. C. Groos, N. Robert, E. Bervas, G. Charmet, Genetic analysis of grain protein-content, grain yield and thousand-kernel weight in bread wheat. Theor. Appl. Genet. 106, 1032–1040 (2003). Medline

109. D. An, J. Su, Q. Liu, Y. Zhu, Y. Tong, J. Li, R. Jing, B. Li, Z. Li, Mapping QTLs for nitrogen uptake in relation to the early growth of wheat, Triticum aestivum L. Plant Soil 284, 73–84 (2006). doi:10.1007/s11104-006-0030-3

110. D. Z. Habash, S. Bernard, J. Schondelmaier, J. Weyen, S. A. Quarrie, The genetics of nitrogen use in hexaploid wheat: N utilisation, development and yield. Theor. Appl. Genet. 114, 403–419 (2007). Medline doi:10.1007/s00122-006-0429-5

111. A. Laperche, M. Brancourt-Hulmel, E. Heumez, O. Gardet, E. Hanocq, F. Devienne-Barret, J. Le Gouis, Using genotype x nitrogen interaction variables to evaluate the QTL involved in wheat tolerance to nitrogen constraints. Theor. Appl. Genet. 115, 399–415 (2007). Medline doi:10.1007/s00122-007-0575-4

112. Z. Li et al., Molecular mapping of QTLs for root response to phosphorus deficiency at seedling stage in wheat (Triticum aestivum L.). Prog. Nat. Sci. 17, 1177 (2007).

113. J.-X. Fontaine, C. Ravel, K. Pageau, E. Heumez, F. Dubois, B. Hirel, J. Le Gouis, A quantitative genetic study for elucidating the contribution of glutamine synthetase, glutamate dehydrogenase and other nitrogen-related physiological traits to the agronomic performance of common wheat. Theor. Appl. Genet. 119, 645–662 (2009). Medline doi:10.1007/s00122-009-1076-4

114. Y. Zhang, J. Tang, Y. Zhang, J. Yan, Y. Xiao, Y. Zhang, X. Xia, Z. He, QTL mapping for quantities of protein fractions in bread wheat (Triticum aestivum L.). Theor. Appl. Genet. 122, 971–987 (2011). Medline doi:10.1007/s00122-010-1503-6

115. D. Bennett, A. Izanloo, M. Reynolds, H. Kuchel, P. Langridge, T. Schnurbusch, Genetic dissection of grain yield and physical grain quality in bread wheat (Triticum aestivum L.) under water-limited environments. Theor. Appl. Genet. 125, 255–271 (2012). Medline doi:10.1007/s00122-012-1831-9

72


http://dx.doi.org/10.1111/pbi.12009


http://dx.doi.org/10.1111/j.1467-7652.2012.00713.x


http://dx.doi.org/10.1093/bioinformatics/btm308


http://dx.doi.org/10.1093/bioinformatics/bts313

http://dx.doi.org/10.1111/j.1365-3180.1974.tb01084.x


http://dx.doi.org/10.1007/s11104-006-0030-3


http://dx.doi.org/10.1007/s00122-006-0429-5


http://dx.doi.org/10.1007/s00122-007-0575-4


http://dx.doi.org/10.1007/s00122-009-1076-4


http://dx.doi.org/10.1007/s00122-010-1503-6


http://dx.doi.org/10.1007/s00122-012-1831-9

116. M. Bogard, M. Jourdan, V. Allard, P. Martre, M. R. Perretant, C. Ravel, E. Heumez, S. Orford, J. Snape, S. Griffiths, O. Gaju, J. Foulkes, J. Le Gouis, Anthesis date mainly explained correlations between post-anthesis leaf senescence, grain yield, and grain protein concentration in a winter wheat population segregating for flowering time QTLs. J. Exp. Bot. 62, 3621–3636 (2011). Medline doi:10.1093/jxb/err061

117. Y. Guo, F. M. Kong, Y. F. Xu, Y. Zhao, X. Liang, Y. Y. Wang, D. G. An, S. S. Li, QTL mapping for seedling traits in wheat grown under varying concentrations of N, P and K nutrients. Theor. Appl. Genet. 124, 851–865 (2012). Medline doi:10.1007/s00122-011-1749-7

118. M. Bogard, V. Allard, P. Martre, E. Heumez, J. W. Snape, S. Orford, S. Griffiths, O. Gaju, J. Foulkes, J. Gouis, Identifying wheat genomic regions for improving grain protein concentration independently of grain yield using multiple inter-related populations. Mol. Breed. 31, 587–599 (2013). doi:10.1007/s11032-012-9817-5

119. X. Liu, R. Li, X. Chang, R. Jing, Mapping QTLs for seedling root traits in a doubled haploid wheat population under different water regimes. Euphytica 189, 51–66 (2013). doi:10.1007/s10681-012-0690-4

120. S. Griffiths, J. Simmonds, M. Leverington, Y. Wang, L. Fish, L. Sayers, L. Alibert, S. Orford, L. Wingen, L. Herry, S. Faure, D. Laurie, L. Bilham, J. Snape, Meta-QTL analysis of the genetic control of ear emergence in elite European winter wheat germplasm. Theor. Appl. Genet. 119, 383–395 (2009). Medline doi:10.1007/s00122-009-1046-x

121. S.-L. Mao, Y.-M. Wei, W. Cao, X.-J. Lan, M. Yu, Z.-M. Chen, G.-Y. Chen, Y.-L. Zheng, Confirmation of the relationship between plant height and Fusarium head blight resistance in wheat (Triticum aestivum L.) by QTL meta-analysis. Euphytica 174, 343–356 (2010). doi:10.1007/s10681-010-0128-9

122. L.-Y. Zhang, D. C. Liu, X. L. Guo, W. L. Yang, J. Z. Sun, D. W. Wang, A. Zhang, Genomic distribution of quantitative trait loci for yield and yield-related traits in common wheat. J. Integr. Plant Biol. 52, 996–1007 (2010). Medline doi:10.1111/j.1744-7909.2010.00967.x

123. U. M. Quraishi, M. Abrouk, F. Murat, C. Pont, S. Foucrier, G. Desmaizieres, C. Confolent, N. Rivière, G. Charmet, E. Paux, A. Murigneux, L. Guerreiro, S. Lafarge, J. Le Gouis, C. Feuillet, J. Salse, Cross-genome map based dissection of a nitrogen use efficiency ortho-metaQTL in bread wheat unravels concerted cereal genome evolution. Plant J. 65, 745–756 (2011). Medline doi:10.1111/j.1365-313X.2010.04461.x

124. S. Griffiths, J. Simmonds, M. Leverington, Y. Wang, L. Fish, L. Sayers, L. Alibert, S. Orford, L. Wingen, J. Snape, Meta-QTL analysis of the genetic control of crop height in elite European winter wheat germplasm. Mol. Breed. 29, 159–171 (2012). doi:10.1007/s11032-010-9534-x

125. T. Wicker, F. Sabot, A. Hua-Van, J. L. Bennetzen, P. Capy, B. Chalhoub, A. Flavell, P. Leroy, M. Morgante, O. Panaud, E. Paux, P. SanMiguel, A. H. Schulman, A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 8, 973–982 (2007). Medline doi:10.1038/nrg2165

73


http://dx.doi.org/10.1093/jxb/err061


http://dx.doi.org/10.1007/s00122-011-1749-7

http://dx.doi.org/10.1007/s00122-011-1749-7

http://dx.doi.org/10.1007/s11032-012-9817-5

http://dx.doi.org/10.1007/s10681-012-0690-4


http://dx.doi.org/10.1007/s00122-009-1046-x

http://dx.doi.org/10.1007/s00122-009-1046-x

http://dx.doi.org/10.1007/s10681-010-0128-9


http://dx.doi.org/10.1111/j.1744-7909.2010.00967.x

http://dx.doi.org/10.1111/j.1744-7909.2010.00967.x


http://dx.doi.org/10.1111/j.1365-313X.2010.04461.x

http://dx.doi.org/10.1007/s11032-010-9534-x


http://dx.doi.org/10.1038/nrg2165