Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
www.sciencemag.org/cgi/content/full/345/6194/1249721/suppl/DC1
Supplementary Materials for
Structural and functional partitioning of bread wheat chromosome 3B
Frédéric Choulet,* Adriana Alberti, Sébastien Theil, Natasha Glover, Valérie Barbe, Josquin Daron, Lise Pingault, Pierre Sourdille, Arnaud Couloux, Etienne Paux,
Philippe Leroy, Sophie Mangenot, Nicolas Guilhot, Jacques Le Gouis, Francois Balfourier, Michael Alaux, Véronique Jamilloux, Julie Poulain,
Céline Durand, Arnaud Bellec, Christine Gaspin, Jan Safar, Jaroslav Dolezel, Jane Rogers, Klaas Vandepoele, Jean-Marc Aury, Klaus Mayer, Hélène Berges,
Hadi Quesneville, Patrick Wincker, Catherine Feuillet
*Corresponding author. E-mail: [email protected]
Published 18 July 2014, Science 345, 1249721 (2014)
DOI: 10.1126/science.1249721
This PDF file includes:
Materials and Methods Figs. S1 to S11 Tables S1 to S14 References
Supplementary text
Table of contents:
1. Sequencing, assembly, scaffolding, curation, and construction of a pseudomolecule ........ 4
1.1. Sequencing .................................................................................................................... 4
1.1.1. BAC DNA extractions and pool preparation ......................................................... 4
1.1.2. Roche/454 paired-end library preparation and sequencing.................................... 5
1.1.3. Illumina sequencing of flow-sorted chromosome 3B ............................................ 6
1.2. Read assembly and scaffolding .................................................................................... 6
1.3. Scaffolding curation ..................................................................................................... 7
1.4. Assignment of scaffolds to physical contigs ................................................................ 8
1.5. Removal of redundant sequences ................................................................................. 9
1.6. Construction of pseudomolecule from chromosome 3B ............................................ 11
1.6.1. SNP marker development via SureSelect Target Enrichment ............................. 11
1.6.2. Ordering and orientating scaffolds along the chromosome ................................. 13
2. Sequence annotation .......................................................................................................... 14
2.1. Gene annotation using an improved version of the TriAnnot pipeline ...................... 14
2.1.1. Training ab-initio predictors ................................................................................ 14
2.1.2. Gene modeling and definition of gene islands ..................................................... 14
2.1.3. Classification of gene, pseudogenes, and gene fragments ................................... 15
2.1.4. Content in protein-coding genes and estimation of the prediction accuracy ....... 16
1
2.1.5. Using gene models to estimate the quality of wheat whole genome and
chromosome-based shotgun assemblies............................................................................ 17
2.1.6. GO term annotation and enrichment analysis ...................................................... 17
2.1.7. Identification of genes putatively involved in resistance to pathogens ............... 18
2.1.8. Non-coding RNA gene predictions ...................................................................... 19
2.2. TE annotation ............................................................................................................. 20
2.2.1. Classification of a library of Triticeae transposable element sequences for
similarity-based annotation ............................................................................................... 20
2.2.2. Development of ClariTE for high quality automated annotation of TEs ............. 21
2.2.3. Solo LTR annotation ............................................................................................ 22
2.2.4. De novo repeat identification ............................................................................... 22
2.3. Distribution of TEs ..................................................................................................... 23
2.3.1. Definition of the centromere ................................................................................ 23
2.3.2. Pattern of TE distribution ..................................................................................... 24
2.3.3. Relative proportion of TE in the vicinity of genes .............................................. 24
2.4. TE insertion time ........................................................................................................ 25
3. Expression analyses ........................................................................................................... 25
3.1. Sample preparation and sequencing ........................................................................... 25
3.2. Read mapping, expression analysis, and detection of alternative splicing ................. 26
3.3. Segmentation/change-point analysis .......................................................................... 26
4. Chromosome partitioning in maize and barley ................................................................. 27
4.1. Partitioning of the maize chromosomes ..................................................................... 27
2
4.2. Partitioning of the barley chromosomes ..................................................................... 28
5. Comparative analyses ........................................................................................................ 29
5.1. Identification of syntenic and nonsyntenic genes on chromosome 3B....................... 29
5.2. Collinearity ................................................................................................................. 30
5.3. Intra- and Inter-chromosomal duplications ................................................................ 30
5.4. Calculation of synonymous (Ks) and nonsynonymous (Ka) substitution rates .......... 32
6. Construction of a genetic map and LD mapping ............................................................... 32
6.1. Plant material and DNA extraction ............................................................................ 32
6.2. Genetic mapping ......................................................................................................... 33
6.3. Linkage Disequilibrium (LD) mapping ...................................................................... 34
7. metaQTL analysis and projection of the confidence intervals on the 3B pseudomolecule
34
3
1. Sequencing, assembly, scaffolding, curation, and construction of a pseudomolecule
1.1. Sequencing
1.1.1. BAC DNA extractions and pool preparation
We used the last version of the 3B physical map (4) to select a minimal tiling path (MTP) for
sequencing. To avoid redundancy, we discarded all contigs containing less than 5 BACs as
they mostly correspond to contigs with low quality fingerprints that carry regions already
present in larger contigs. This resulted in a MTP of 8,452 BAC clones assembled into 1,282
BAC contigs. DNA from all 8,452 BAC clones was extracted individually by an adapted
alkaline lysis method that enabled limiting contamination by E. coli genomic DNA to around
10%. Extractions were performed on 96-well plates. Inoculation and DNA extraction were
repeated 12 times for each clone using 12 independent plates in order to reach a sufficient
amount of DNA for downstream library preparation. Each BAC clone was incubated in 1.2 ml
2YT (+ 12.5 µg/ml chloramphenicol) on 96-well plate for 20 hours at 525 rpm at 37°C. After
centrifugation for 10 min at 4°C, the pellets were re-suspended in 200 µl prechilled P1 buffer
(Qiagen, Hilden, Germany) containing RNAse (50 µg/ml). Then, 200 µl P2 buffer (Qiagen)
were added and the plate content was mixed by a fivefold gentle turnover. Plates were left 4
minutes on ice and 200 µl pre-chilled P3 buffer (Qiagen) were added. After a gentle fivefold
turnover mixing, the plates were incubated on ice during 10 minutes followed by 30 min
centrifugation at 4100 rpm at 4°C. The supernatant was transferred to clean plates and 350 µl
of isopropanol were added and gently mixed before centrifugation for 30 min at 4000 rpm at
room temperature. Pellets were washed twice using 400 µl 70% ethanol, vacuum dried, and
dissolved in 25 µl TE buffer (4:0.2). One 96 plate was used for BAC-end sequencing by the
Sanger method in order to control the correct BAC position on the plates and to obtain BAC-
end sequence (BES) information for further sequence assembly. Following DNA extraction,
4
the content of the wells corresponding to the same BAC clone were pooled together. After
pooling, each extraction was quantified by Fluoroskan microplate fluorometer measurement.
BAC pools were created as follows using information from the MTP: equimolar amounts of
10 different BACs or less belonging or not to the same physical contig were mixed to create
DNA pools containing at least 15 µg DNA. In total, 922 BAC pools were created while
optimizing the pooling of overlapping BACs.
1.1.2. Roche/454 paired-end library preparation and sequencing
The BAC pools were used to construct 922 mate-pair libraries of 8 kb insert size following a
slightly modified Roche/454 protocol. Briefly, 15 µg of each BAC pool were sheared to about
8 kb, end repaired with the END-it-Repair kit (Epicentre), and ligated to biotinylated loxP
adaptors (Roche). After gel size selection of 8 kb bands and fill in, 300 ng DNA were
circularized by the Cre recombinase and remaining linear DNA was digested by the Plasmid
Safe ATP dependent DNAse (Epicentre) and exonuclease I. Circular DNA was fragmented by
Covaris (Covaris Inc., USA) shearing and biotinylated fragments were immobilized on
streptavidin beads. As the original Roche/454 protocol did not allow creating barcoded mate
pair libraries, modified barcoded adapters were used for subsequent ligation and then the
library was prepared following the Roche/454 protocol without further modifications. After
library quantification by qPCR, emulsion PCRs were performed on pools of three libraries.
Six libraries were then loaded on one PTP and pyrosequenced using the GS FLX Titanium
Instrument (Roche) according to the manufacturer protocol, in order to obtain at least 20 Mb
of sequence per pool.
5
1.1.3. Illumina sequencing of flow-sorted chromosome 3B
Approximately 20,000 copies of chromosome 3B were flow-sorted in three batches. Their
DNA was purified and multiple-displacement amplified (MDA) by illustra GenomiPhi V2
DNA Amplification Kit (GE Healthcare, Piscataway, USA) according to Simkova et al. (58).
DNA samples obtained in three independent amplifications were pooled and 1 µg of DNA
was sonicated to a 150- to 600-bp size range using the E210 Covaris instrument (Covaris,
Inc., USA). Fragments were end-repaired and 3’-adenylated before Illumina adapters were
added using the NEBNext Sample Reagent Set (New England Biolabs). Ligation products of
300-600 bp were gel-purified and size-selected. DNA fragments were PCR-amplified using
Illumina adapter-specific primers. After library profile analysis by the Agilent 2100
Bioanalyzer (Agilent Technologies, USA) and qPCR quantification, the library was
sequenced using 100 base-length read chemistry in paired-end flow cell on the Illumina
HiSeq2000 (Illumina, USA). This generated 82 Gb of sequences.
1.2. Read assembly and scaffolding
Given the specificity of the sequencing strategy, an automated assembly pipeline was
developed for the project. First, the pooled BAC sequences were evaluated and cleaned for
quality and contamination by E. coli genomic and vector sequences using SOAP2 aligner
(59). Then, the sequences from each BAC-pool were assembled into scaffolds using the
Newbler assembler from Roche (version MapAsmResearch-04/19/2010-patch-08/17/2010,
http://www.454.com). BESs produced for all the clones were used to check the presence of
the expected BAC sequence in each pool using BLAST (60).
6
1.3. Scaffolding curation
The first BAC pool sequence assemblies resulted in 16,136 scaffolds (293,806 contigs) with a
N50 of 275 kb and an average of 12.5 scaffolds per physical contigs. However, the contig
N50 was of 12 kb only, meaning that, although the scaffolds were large, they comprised many
small contigs, because of a high amount of repeated DNA (mainly long terminal repeats of
retrotransposons). To improve the accuracy of the assembly and scaffolding, we developed a
pipeline based, first, on mining information provided by Newbler about the positions of paired
reads in the contigs to validate and improve the scaffolding computed by Newbler. In a
second step, the pipeline integrates data from the Illumina chromosome 3B shotgun and BAC-
end sequencing. Based on the two output files “454PairStatus.txt”, containing all the
information regarding the read position in the assembly, and “454Scaffolds.txt”, presenting
the contig organization on the scaffold, we developed a program pinpointing potential
scaffolding errors. In addition, a module was developed to provide information to the curator
about the potential introduction of previously unplaced contigs into scaffold (for gap filling)
based on read pair data. Finally, the decision to correct the scaffolding was taken by a curator
after viewing and inspecting the assembly. A corrected version of the file dedicated to
scaffold organization (“454Scaffolds.txt”) was produced by the curator and used to create
automatically a corrected sequence.
The second step aimed at ensuring sequence accuracy that takes into account the high error
rate at homopolymer sites observed with pyrosequencing technology. In addition, manual
curation did not allow decreasing the number of gaps in the assembly. To address these two
issues, we used the 82 Gb of Illumina reads produced from DNA of flow-sorted chromosome
3B. First, we carried out the consensus sequence correction using a homemade program (61)
that maps reads on the reference 3B sequence using BWA (Burrows-Wheeler Aligner,
http://bio-bwa.sourceforge.net/), detects the variations based on the read quality and the
7
coverage at each nucleotide position, and finally corrects the reference sequence. Then, using
the reads mapped close to gaps in the reference sequence, we used Gapcloser
(http://soap.genomics.org.cn/) to extend sequence edges, estimate with more precision the gap
size or fully fill the gap when possible. In total, 126,290 bases were corrected and 109,914
gaps were filled. These finishing steps resulted in a total of 5,109 scaffolds with a N50 of
463 kb and an average of 4 scaffolds per physical contigs (Table S1).
1.4. Assignment of scaffolds to physical contigs
The 5,109 scaffolds resulting from the finishing steps were used to build a pseudomolecule of
chromosome 3B. At that stage, 110 small scaffolds were discarded because they were
potentially originating from bacterial DNA contamination, leading to a set of 4,999 scaffolds
(Table S1). Although the pooling scheme was optimized to maximize the pooling of
overlapping BACs originating from the same physical contig, 477 of the 922 initial pools
(52%) contained BACs originating from more than one physical contig (up to 5). As all reads
from a given BAC-pool shared the same barcode, additional information was required to
assign the assembled scaffolds to their physical contigs of origin. For that, we searched for
sequence identity between scaffolds and all available sequence tags that were assigned to
individual BACs. Two sources of tags were used: 42,551 non-redundant BESs and 327,282
Whole Genome Profiling (WGP) tags generated by (62) on a subset of the 3B physical map.
This allowed us to assign 96% of the sequence to a contig of the physical map. The remaining
4% correspond to small scaffolds that did not match any BES or tag. Homemade scripts
(available upon request) and a relational MySQL database, connecting the data of the physical
map, tags, pools, and sequences, were developed to retrieve BESs and WGP tags expected to
be present in the assembly of each pool.
8
1.5. Removal of redundant sequences
Sequencing individual BAC pools raises the problem of sequence redundancy in the assembly
due to the fact that the same genomic locus can be assembled twice, independently, as it was
contained in overlapping BACs that were sequenced in different pools. This redundancy
needs to be distinguished from truly duplicated regions because it would limit our ability to
understand the role of duplications in the wheat genome evolution. A solution for that would
have been to perform a unique assembly run with the full set of reads coming from several
BAC pools. However, in that case, the size of the region assembled and the proportion of
repeated kmers increase, leading to lower the accuracy of the assembly. Therefore, we chose
to assemble each pool separately and remove the redundancy afterwards. Removing
redundancy could not be done using any of the available read assemblers since scaffolds are
large, repeat-rich, contain gaps of variable size, and, therefore, are not suitable for classical
available tools. Thus, we developed a new program called scaffAssembler that uses BLAT
(63) to assemble scaffolds through iterative pairwise alignments. It performs an all-by-all
comparison of a series of scaffolds and then parses the alignments to capture the presence of
redundant sequences. Three different cases were distinguished: 1) identical scaffolds: when
two scaffolds overlap over their full length; 2) included scaffolds: when a scaffold is fully
identical to a part of a larger one in the dataset; and 3) overlapping scaffolds: when two
scaffolds share identical sequences encompassing their extremities. The criteria used to
consider that 2 scaffolds share a redundant sequence are the following: at least 10 kb
contiguous sequences sharing at least 99% nucleotide identity. When identical scaffolds were
found, one was randomly discarded. For a scaffold fully included into another larger one, the
smaller scaffold was discarded. In the case of overlapping scaffolds, scaffAssembler
computed the assembly by retrieving the coordinates of the matching segments (found by
BLAT with the extendThroughN parameter) and, compared the gap content of the 2
9
redundant segments in order to discard the one with the highest amount of Ns. This procedure
was applied iteratively by pairwise alignments until every case of redundancy was treated
within the considered set the scaffolds.
Two types of sequence redundancies were distinguished and treated separately: “expected”
versus “unexpected” based on the physical map information, i.e. the overlaps between BACs
predicted through comparisons of their fingerprints. First, the expected redundancy
corresponds to every pair of known overlapping BACs that are split into two different
sequencing pools. In total, 521 out of the 922 initial BAC-pools (57%) contain overlapping
MTP BACs. For the expected redundancy, scaffold-based assembly was applied directly with
scaffAssembler on targeted BAC-pools. This led to a decrease in the number of scaffolds
from 4999 to 4747 and to removal of 11.7 Mb (1.2%). However, the major part of redundancy
was in fact unexpected in the physical map. It comes from four different sources: 1) some
BAC-contigs overlapped but were assembled into 2 separated contigs because the local
coverage (i.e. number of BACs covering a locus) was too low for joining with FPC (64) (such
cases were expected especially because the physical map of chromosome 3B was assembled
at high stringency (1e-45) to prevent from chimerical BAC-contigs); 2) some BAC-contigs
may be redundant but were mistakenly assembled into separated contigs by FPC; 3) BACs
may be misassembled by FPC (fingerprint assembly errors) i.e. they do not belong to the
predicted BAC-contig; and 4) some wells might be contaminated with other clones. In the
latter two cases, scaffolds originating from misassembled or contaminated BACs are fully
redundant because they carry a genomic locus sequenced in another BAC-pool.
In contrast to the expected redundancy, solving the unexpected redundancy required an all-by-
all comparison of the full set of scaffolds. As ca. 85% of the sequence is made of repeated
elements, all-by-all alignments of 1 Gb of sequence is computationally very intensive and
therefore, we developed a two-step strategy in which we first used the assignment of BAC-
10
contigs to 8 deletion bins in order to divide the chromosome in 8 fractions. All-by-all
comparisons limited to scaffolds belonging to the same deletion bin were applied using
iterative runs of scaffAssembler. This led to a decrease in the number of scaffolds to 3,769
and resulted in discarding 59.7 Mb (6%) of redundant sequences. Then, we developed a
strategy based on determining shared TE-junctions between scaffolds to identify potentially
redundant sequences. Indeed, junctions between nested transposons are extremely abundant
and mainly unique in the genome and thus are frequently used as specific molecular markers
(Insertion Site Based Polymorphism, ISBP (50)). They can be used in-silico as a signature to
identify redundancies i.e. scaffolds sharing the same set of ISBPs. In total, 166,385 ISBPs
were predicted along the 3,769 scaffolds with isbpFinder (50) and pairs of scaffolds sharing
identical junctions were detected. Hence, instead of aligning the full dataset, we focused on
38 Mb of TE junctions. With this approach, we identified 2136 scaffolds clustered into 349
groups of potentially overlapping scaffolds. Their assembly decreased to 2,827 the number of
scaffolds and discarded 88.0 Mb (9.6%) of redundancy. Since a 10 kb length threshold was
applied to consider redundant regions, it was not possible to discard small redundant
scaffolds. However, 19 small scaffolds that carry a redundant copy of a gene predicted on
another scaffold (>99% identity over the full gene length + 500 surrounding bps) were
identified and discarded. This led to a final number of 2,808 scaffolds representing 833 Mb
(Table S1).
1.6. Construction of pseudomolecule from chromosome 3B
1.6.1. SNP marker development via SureSelect Target Enrichment
Genomic DNA was extracted from 10 wheat accessions (Chinese Spring, Renan, Preimio,
Robigus, Xi19, Apache, Aztec, Autan, Cezanne and Uli3) following the protocol described by
Graner et al. (65). Sequence capture was performed on these 10 lines using the SureSelect
11
Target Enrichment System for Illumina Paired-end Sequencing Library version 1.0 May 2010
(Agilent, Santa Clara, CA) following the manufacturer’s procedure. Hybridization was
performed with a SureSelect Target enrichment library containing 120-nu baits corresponding
to 52,265 “low copy DNA – TE” high confidence ISBP (Insertion Site-Based Polymorphism)
markers. Following hybridization, captured DNA was amplified to add either index tag #6 or
#12. Equimolar pools of two samples carrying different tags were constructed and sequenced
on an Illumina HiSeq2000 instrument in 2x100 bp paired-end reads. Illumina reads were
mapped against the ISBP reference sequences using the Mosaik package (The MarthLab;
http://bioinformatics.bc.edu/marthlab/wiki/index.php/Software). For MosaikAligner, the
following parameters were used: -mm 3 -act 60 -minp 0.6 –mmal. SNP calling was done with
GigaBayes (The MarthLab; http://bioinformatics.bc.edu/marthlab/wiki/index.php/Software)
using the following parameters: --ploidy diploid --sample multiple --PSL 1 --O 4 --CRL 20.
Finally, GigaBayes output was processed using a Perl script (snpRanking; available upon
request) to classify the SNPs into four different classes according to the detection of
homozygous/heterozygous individuals in the selected panel. Class 1: only homozygous AA
and BB alleles were detected. Class 2: homozygous AA and BB, and heterozygous AB lines
were detected. Class 3: only homozygous AA and heterozygous AB lines were detected. Class
4: only heterozygous AB lines were detected. A heterozygous genotype was considered when
at least 10% of the reads validated the presence of a heterozygous locus. In total, 49,836 SNPs
were discovered including 33,220, 5,857, 8,858 and 1,901 SNPs from Class 1 to 4,
respectively. Only SNPs from Class 1 and 2 were selected for further analyses, representing
39,077 markers. SNP context sequences were then mapped back to 3B sequence scaffolds.
These data were then used to select a subset of evenly distributed SNPs according to the
following criteria: for scaffolds between 80 and 200 kb, a single SNP has been selected; for
scaffolds larger than 200 kb, we selected 1 SNPs every 200 kb so that they were evenly
12
distributed along the scaffold. Priority was given to Class 1 SNPs being polymorphic between
Chinese Spring and Renan for mapping purposes. This resulted in a set of 3,735 SNPs (3,390
Class 1 and 345 Class 2) that were submitted to KBioscience (Hoddesdon, UK) for KASPar
assay design. These assays were used to genotype a set of wheat accessions (see section 6).
Eventually, 3,075 SNPs led to useful genotyping results.
1.6.2. Ordering and orientating scaffolds along the chromosome
Scaffolds were ordered and orientated according to their marker content. Molecular markers
were ordered along a genetic map that was refined using LD data (see section 6). Markers
were mapped on scaffold sequences by similarity search of the context sequence using
BLAST (99% identity or more over the full length of the context sequence): 2594 anchor
SNPs have been assigned to 1006 scaffolds. In addition, 1180 additional markers for which a
sequence was available (primers or context sequence) and originating from the consensus map
(see section 6) were also located on the scaffolds. However, a higher weight was given to the
anchor SNPs compared to markers from the consensus map: when both types were present on
a scaffold, only anchor markers were considered. When a scaffold carried 2 or more markers,
the minimum position (i.e. closest to the telomere of the short arm) was considered as the
position of the scaffold along the chromosome. In addition, when those markers have a
different mapping position, we were able to orientate the scaffold along the telomere-
centromere axis. Finally, relative positions of BES were used to order scaffolds that belong to
the same physical contig (because BACs are ordered within each contig). This information
was then used to infer a position for 270 scaffolds without marker information but for which a
neighbor scaffold was genetically anchored. Altogether, we assigned a position to 1358
scaffolds representing 774 Mb (93%; N50=949 kb) shaping the 3B pseudomolecule. Among
them, 489 scaffolds (52% of the size of the pseudomolecule) have been orientated. Finally,
13
1450 scaffolds remained unplaced along the pseudomolecule. They only account for 7% of
the sequence and 6% of the predicted genes.
2. Sequence annotation
2.1. Gene annotation using an improved version of the TriAnnot pipeline
2.1.1. Training ab-initio predictors
Ab-initio gene predictors Augustus (66) and GeneID (67) have been previously trained with a
limited sample of genes (2). Here, we took advantage of having access for the first time to
thousands of wheat genes to improve the accuracy of predictors. Thus, we first performed an
automated gene modeling with a previous untrained release of TriAnnot over the full wheat
3B sequence that led to 9,233 predictions. Out of them, 6,475 were manually checked under
Artemis (68) and their structure was corrected as needed. Among them, 3,273 coding
sequences (CDSs; average length: 1,219 nucleotides: average number of exons: 4.5) had a
structure automatically validated, i.e., each structural feature was supported by biological
evidence (see below). Those were used for training ab-initio predictors of TriAnnot (53)
which significantly improved the specificity of Augustus and both the sensitivity and
specificity of GeneID. In addition, a wheat specific matrix was computed for the Eugene
combiner (69) with our gene sample. A second run of gene modeling was then launched with
the newly trained predictors while mapping manually curated genes back to the chromosome.
2.1.2. Gene modeling and definition of gene islands
Several improvements were made to the TriAnnot pipeline (53) to increase the accuracy and
validate gene models. First, by combining evidence from different sources, we improved the
module responsible for selecting the best gene model at a given locus. Selection is performed
through a scoring method that estimates the accuracy of the CDS structure, based on checking
14
the reliability of the positions of the start codon, the stop codons, and the splicing sites.
TriAnnot checks whether those features correspond to biological data according to spliced
alignments of transcripts and proteins over the chromosome using Exonerate (70). Using
transcript sequences (from RNASeq data generated in this study, and Triticeae ESTs/mRNAs
publicly available), the module can validate, or not, the predicted splicing sites in the CDS
model. For the validation of the predicted start and stop codons, the scoring module considers
the similarity with homologous proteins in related species. Taking into account that
extremities are variable in length and sequence between orthologous proteins, we defined a
range of ten amino acids to consider that the predicted start and stop codons correspond to
that of a protein already identified in another species. This automated procedure to support
gene modeling was used to assign a confidence index: “High Confidence” when all features
are supported by biological evidence, or “Low Confidence” when one or more features are not
supported. The score attributed to a prediction is the sum of the percentage of supported
features, the percentage of amino-acid identity and percentage of overlap with the best
BLAST hit in related plant proteomes. Prediction with the highest score at a given locus was
kept in the released annotation. Finally, ab-initio predictions that do not share any significant
similarity with proteins annotated in plant genomes or without transcription evidence were
considered as false positive, and therefore, discarded from the annotation.
Considering the 7,264 genes and pseudogenes annotated along the 3B pseudomolecule, the
median of intergenic distances of 30 kb was used as a threshold to defined genes that are
clustered into islands.
2.1.3. Classification of gene, pseudogenes, and gene fragments
Each predicted protein sequence was then analyzed in order to determine if the coding
sequence is likely nonfunctional due to mutation or truncation (pseudogene or gene fragment).
15
Gene models displaying internal stop codons, frame shift mutations, or deletions (leaving
between 50% and 70% of the length of a complete homolog) within the CDS were considered
as pseudogenes. The genes showing similarity over less than 50% of the length of their best
homolog in plant protein databank were considered as gene fragments.
2.1.4. Content in protein-coding genes and estimation of the prediction accuracy
In total, 7,703 protein-coding genes were predicted from the 2808 scaffolds of the
chromosome 3B sequence including 7,264 on the pseudomolecule. Manual curation was
performed for 48% of them (3711/7703) by checking and correcting the accuracy of gene
modelling with respect to similarity with transcripts and homologous proteins. The automated
procedure for validation of the CDS coordinates revealed that 59% (4571) are of “High
Confidence”, meaning that biological evidence support positions of start codon, stop codon,
and all splicing sites. To identify potential missing gene models in the annotation, we
compared our annotation (using BLASTN) with the gene models predicted with the MIPS
annotation pipeline from assemblies of whole Chromosome Survey Sequences (CSS; (19).
We focused only on high confidence predictions, i.e. HCS1 to 3 models in the CSS
annotation. Genes were considered missing from our annotation if there was no hit with at
least 99% identity covering more than 50% of the BLAST query or hit length. This resulted in
a list of 1,651 potentially missing genes. Out of these, 839 were found on the chromosome 3B
scaffolds (at least 99% identity for at least 90% of the gene length) and, thus, represent
potential errors of the TriAnnot pipeline annotation. To detect those for which biological
evidence are available, we filtered this dataset by searching for similarity with the
Brachypodium genes (at least 35% identity and 70% overlap) and RNASeq-derived
transcripts produced in this study (99% identity and 90% overlap). This left only 226
potentially missing predictions. These were manually inspected to understand why TriAnnot
16
did not identify them. They could be explained as follows: 1) Part of the gene model was
masked because of similarity with our TE-library; 2) The gene model is a pseudogene; or 3)
The gene model was in fact predicted by TriAnnot, but with a very different structure (the
50% overlap threshold was too stringent). Finally, only 25 gene models predicted by the
MIPS pipeline on the chromosome 3B CSS assemblies and not found in the 3B
pseudomolecule annotation correspond to genes likely functional, showing transcription
evidence and similar to a Brachypodium gene, that were not found by TriAnnot, representing
0.3% of the total predicted gene number.
2.1.5. Using gene models to estimate the quality of wheat whole genome and
chromosome-based shotgun assemblies
The 7,264 gene models of the 3B pseudomolecule were mapped on the sequence assemblies
from the same genotype obtained through genome-wide or chromosome-wide sequencing
approaches. When compared to the 949,279 contigs representing a 5x coverage whole
genome shotgun sequence produced by (8), 79% of the 3B genes aligned to 29,498 contigs
(average size = 499 bp). Overall, 27% of these contigs were assigned to the B-genome and
none of them were anchored to a specific chromosome. When compared with the Illumina
shotgun assembly of the 3B chromosome generated by the IWGSC (19), 95% of the 3B genes
were found on 8,059 contigs (average size 6.8 kb) of which 57% were virtually ordered using
synteny.
2.1.6. GO term annotation and enrichment analysis
Genes were classified in the 3 main GO categories: molecular function, biological process and
cellular component. Out of the 7,264 genes predicted on the pseudomolecule, 5128 (71%)
were associated with at least 1 GO term, representing 1,567 unique GO terms. To determine
17
gene ontology enrichment, similarity search using BLASTP (E-value < 1e-05) was performed
for each predicted gene product against the PLAZA 2.5 protein database (71). Based on the
functional information of the homologs (GO or InterPro [IPR]), consensus functional
information was then transferred to the 3B protein candidate. Functional terms from the 5 best
homologs with majority coverage (i.e., >50%) were considered for the analysis. Then, GOBU
(Gene Ontology Browsing Utility (72)) was used for enrichment calculations. The full set of
3B gene products annotated on the pseudomolecule was used as the reference comparison set
for the enrichment analysis in the R1, R2, and R3 regions. P-values were calculated under
GOBU with the Multiview Plugin and Fisher’s exact test and they were adjusted with
Benjamini and Hochberg's Method using the R module called “p-adjust” (correction for
multiple testing). Finally, the redundancy from the list of enriched GO terms was removed
using the program GO Trimming (73) using default parameters.
2.1.7. Identification of genes putatively involved in resistance to pathogens
Most resistance genes against fungal pathogens identified in plants are from the NBS-LRR
family (nucleotide-binding leucine-rich repeat). Thus, in order to identify wheat gene products
that are putatively related to resistance against pathogens, we used PFAM (Release 27)
domains PF00931 (NB-ARC) (74). We also searched for similarity against domain PF03018
(Dirigent) that represents a family of proteins that are induced during disease response in
plants. In addition, we added all gene products which best BLAST hit in the rice proteome
was annotated as a disease resistance protein. In total, 171 putative resistance genes were
identified on chromosome 3B. Regions R1, R2, and R3 carry 43, 36, and 92 of these genes,
respectively. In addition, 68 are syntenic genes while 103 are nonsyntenic.
18
2.1.8. Non-coding RNA gene predictions
Non-coding RNA genes are usually not or poorly annotated in genome sequencing projects.
Among the few annotated ncRNA genes in GenBank, rRNA and tRNA genes are the most
common ones. In plants, small nuclear RNAs (snRNAs) and small nucleolar RNAs
(snoRNAs) are also well-studied ncRNA families. snoRNAs are a class of ncRNAs that
primarily guides chemical modifications of other RNAs, mainly rRNAs and snRNAs in
eukaryotes. There are two main classes of snoRNAs, the C/D box snoRNAs that are
associated with 2'-O methylation, and the H/ACA box snoRNAs that are associated with
pseudouridylation. snRNAs, also commonly called U-RNAs, are involved in the processing of
pre-mRNAs.
We predicted rRNA genes by using RNAmmer (75) and Rfamscan (v1.0,
ftp://ftp.sanger.ac.uk/pub/databases/Rfam/tools/). RNAmmer was used with both 'euk' and
‘bac’ parameters to find nuclear rRNA genes and those that could originate from the
mitochondrial and/or chloroplastic genomes. In total, 89 5S rRNA genes were retained in the
annotation. They correspond to Rfamscan predictions that share similarity (searched with
BLAST) with known wheat 5S rRNA sequence (X06094) and half of them are organized in
tandem repeats. Regions of similarity with chloroplast and mitochondrial rRNAs were also
found. Those showing also Rfamscan motifs were retained in the final annotation. In addition,
622 tRNA genes were predicted on 3B by tRNAscan-SE (76).
Rfamscan initially predicted the presence of 42 small nuclear RNA (snRNA) genes. Curation
resulted in keeping 22 of them that contain USE and TATA boxes (77). Remarkably, the
number of U1, U2, and U6 candidates in this chromosome is comparable to the total number
of known predictions in the genome of Arabidopsis thaliana. Around 1,250 small nucleolar
RNA (snoRNA) gene candidates were initially predicted by using Rfamscan. Among them,
1,121 were homologous to snoRNA71 and were not considered in this annotation. Manual
19
curation of all other candidates led to validate 92 potential snoRNA genes of which 70 are
organized in 16 clusters.
2.2. TE annotation
2.2.1. Classification of a library of Triticeae transposable element sequences for
similarity-based annotation
TE annotation was performed using a library of 4,929 sequences: 1,543 Gypsy, 797 Mariner,
764 Copia, 554 CACTA, 438 unclassified repeats, 234 LINEs, 175 unclassified DNA
transposons with Terminal Inverted Repeats (TIRs), 168 unclassified Long Terminal Repeats
(LTRs) retroelements, 127 Mutator, 80 Harbinger, 18 unclassified DNA transposons, 16
Helitron, 11 hAT, and 4 SINEs. This data set originated from two sources: the TREP library
(http://wheat.pw.usda.gov/ITMI/Repeats/), and a previous curated annotation of TEs found in
18 Mb from chromosome 3B (2). Based on the full-length TE sequences present in the library
(i.e. elements having TIRs, LTRs and/or features typical from complete LINE/SINE), a first
clustering step was applied for each superfamily. For the class II transposons, Miniature
Inverted-repeat Transposable Elements (MITEs) were considered independently from their
complete parent TE copies in order to create specific clusters for these highly repeated, short,
non-autonomous elements. An all-by-all BLAST alignment was first performed and used to
cluster sequences with MCL (78); using option -I 1.2). Multiple sequence alignments were
then computed with MAFFT (79) for all clusters comprising 3 or more copies. Jalview was
then used to manually inspect every multiple alignment and their related neighbor-joining
tree. For clusters comprising a large number of sequences and when several monophyletic
groups were clearly separated, each subgroup was defined as a sub-variant of the family. For
LTR-retrotransposons, the borders of the two LTRs were searched using the TRsearch
20
program included in REPET (80) for each element. These positions were required for ClariTE
(see below) for the automated curation of the predictions.
2.2.2. Development of ClariTE for high quality automated annotation of TEs
The 2,808 scaffolds composing the 3B chromosome were investigated for TE content using
RepeatMasker (cross_match engine with default parameters; http://www.repeatmasker.org/)
with our curated TE library. Since RepeatMasker does not reconstruct the nested TE patterns
and gives overlapping predictions (one locus can share similarity with several TEs in the
library), we developed the ClariTE program (available upon request) to correct the raw
similarity search results and, consequently, provide high quality TE annotation for
downstream structural and evolutionary analyses. ClariTE is based on our TE library
classification and format (see above). ClariTE performs the three following steps:
a. Resolution of overlapping predictions. To solve the overlap between two
predictions, priority was given to keeping the prediction that covers an extremity of a
TE. If none or both of the predictions cover a TE extremity, priority was given to
keeping the longest prediction and recalculating positions of the other one.
b. Merging predictions. This step is essential to resolve the over-fragmentation of the
predictions. Fragmentation is due, firstly, to the presence of gaps in the scaffolds, and,
secondly, to the fact that a newly identified TE copy may diverge from the reference
element so that one element is not predicted as a single piece but is rather split into
several pieces matching different parts of elements from the same family. In that case,
all neighbor pieces related to the same family were merged into a single feature if the
collinearity of the matching segments was respected, except for LTR matching
segments. Indeed, since LTR positions of reference TEs are known and annotated in
our library, this information was considered during merging process.
21
c. Reconstruction of nested TEs. We developed a procedure to join separated features
that are part of the same TE and have been scattered by more recent insertions (i.e.
shaping nested clusters). Joining was allowed when 2 segments matching the same
family (with respect of the collinearity between the prediction and the reference TE)
are separated at a maximum of 10 predicted TEs. The final stage of the annotation is
the classification of intact full-length TEs versus fragmented TEs. Intact full-length
TEs are predictions covering at least 90% of the reference complete TE in the library
and for which both extremities were identified (in a range of 50 nucleotides).
2.2.3. Solo LTR annotation
Based on of the 30,406 intact RT-LTRs annotated on chromosome 3B, we built a library by
extracting 18,928 LTR sequences that started with TG and ended with CA dinucleotides, the
common motifs found at the border of LTRs in wheat. This library was then used for
similarity search with RepeatMasker. In addition, to distinguish solo LTR from truncated
LTR-RT, we searched for the presence of a 5 bp-Target Site Duplication (TSD; one
nucleotide variation tolerated) flanking the matching region. In total, we detected 3,998 solo
LTRs with TSD.
2.2.4. De novo repeat identification
De novo annotation was performed with the REPET package V2.0 that combines the
TEdenovo and TEannot pipelines (80). REPET runs on sequence contigs rather than on
scaffolds and therefore, we ran it on the 294,691 contigs of assembly version 2.1 (Table S1)
representing 986.1 Mbp. The known TE library was first used to focus the de novo detection
on unknown repeats: every sequence sharing more than 80% identity with a known consensus
TE sequence was excised from the initial sequence. At that stage, only contigs larger than
22
5 kb were considered. Repeats corresponding to microsatellite and repeated genes were
filtered out. Finally, a consensus library of 7,009 elements was built and we kept 1,573
consensus sequences of clusters for which at least one full-length copy was found. The library
was then used for similarity search on the 3B scaffolds previously masked with known TEs.
This allowed us to assign 3.6% of the sequence to a family identified de novo.
2.3. Distribution of TEs
2.3.1. Definition of the centromere
A putative location of the centromere was estimated by plotting along the pseudomolecule the
density of Cereba (called CRW) and Quinta LTR-RTs that are known to be associated with
the active centromere (22). The percentage of CRW and Quinta was estimated in sliding
windows of 10 Mb with a step of 1 Mb. The average percentage of CRW and Quinta elements
along the 3B pseudomolecule is 0.4% ranging from 0.0% to 5.5% (per 10 Mb). Two major
peaks corresponding to regions in which the proportion of the two elements is higher than
1.0% were observed (Fig. S2). The first peak covers 7 Mb (265-272 Mb) with an average of
2.6% of the two elements while the second larger peak covers 68 Mb (319-387 Mb) and
contains on average 3.1% of CRW and Quinta. We then examined the conservation of the 179
genes located in the centromeric region of wheat chromosome 3B and the 22 genes located on
rice chromosome 1 between position 16.7-18.5 Mb that correspond to the region with
complete crossover suppression (23). Fourteen matches were identified indicating that a
majority of the rice genes are found in the 3B putative centromeric region. Combined with
data on recombination rate and linkage disequilibrium (Fig. S2), we considered the 122 Mb
region from position 265 Mb to position 387 Mb as the centromeric/pericentromeric region.
23
2.3.2. Pattern of TE distribution
The TE distribution along chromosome 3B was analyzed in a sliding window of 10 Mb
(step=1 Mb). Segmentation analysis of the global TE content divided the 3B chromosome into
5 regions (Fig. 1C). Boundaries of both telomere regions (0-63 Mb, 700-769 Mb) are very
close to those of regions R1 and R3. In those distal regions, TE proportion is the lowest with
73% and 68%, respectively, (Table 2). Segmentation revealed that the central chromosomal
region could be divided into 3 regions according to their TE content: 64-262 Mb, 263-384 Mb
and 385-699 Mb. The centromere corresponds to the region with the highest percentage of
TEs (93%), while both internal parts of the 3BS and 3BL arms show an intermediate and
equivalent level of TEs: 88%.
TEs distribution was analyzed in more detail by distinguishing the different superfamilies
(gypsy, CACTA, copia; Fig. S6). Their distribution appeared specific to each superfamily and
we observed three completely different patterns along the chromosome. The decrease of TEs
in the distal regions results essentially from a decrease in gypsy content. In contrast, the
amount of CACTA transposons increases towards the distal regions R1 and R3. Finally, the
distribution of the amount of copia is stable along the chromosome at the 10 Mb scale. The
diversity of the types of TEs was also investigated along the chromosome by plotting the
number of families representing 99% of the TE fraction (N99) within a sliding window of
10 Mb. Results showed that the TE diversity is significantly higher in the distal regions than
in the central region.
2.3.3. Relative proportion of TE in the vicinity of genes
To analyze the TE context in the vicinity of genes, we combined the result of both gene and
TEs annotation. Thus, we calculated the average percentages of TEs 20 kb upstream and
downstream on the set of 5,964 filtered CDSs (see section 5.1) present in the pseudomolecule.
Two opposite distribution patterns of TEs super families were found. For the three most
24
abundant super families (gypsy, CACTA, copia), we observed that the abundance increased
with the increasing distance from the genes (Fig. S7). For the other superfamilies, and
particularly for DNA transposons and LINEs, the data indicate that there are enriched in the
10 kb compartments flanking the 5’ or 3’ ends of the CDSs, with a peak around 1 to 5 kb near
the CDSs (Fig. S7).
2.4. TE insertion time
The insertion dates of 21,619 intact LTR retrotransposons were estimated by aligning both
LTR and using a molecular clock as described in (2). Fig. S5A shows the distribution of
insertion time of LTR-RTs. A peak was observed at 1.51 MYA i.e. before the
allopolyploidization, confirming previous findings (2). Decomposing this analysis per family
(with more than 100 estimated dates) led us to assume that each family had its own pattern of
activity (Fig. S5B). Amplification burst could be as recent as 0.6 MYA for the Carmilla
family and as old as 3.2 MYA for the Bare-1 family. In addition, a great variability was also
noticed for the duration of the TE activity. Although most of the families seem to have been
active for a relatively short period of time, some appeared to have been active for several
million years. For example, “Daniela” activity showed a peak spanning about 1 MY; whereas,
“Nusif” elements were active over a period of more than 3 MY (Fig. S5).
3. Expression analyses
3.1. Sample preparation and sequencing
Thirty RNA samples were used for expression analyses. They correspond to RNAs extracted
in duplicates from five organs (root, leaf, stem, spike, and grain) at three developmental
stages each from hexaploid wheat cv. Chinese Spring (4) (Table S2). RNA quality was
assessed using an RNA nano Chip on the Agilent Bioanalyzer (Agilent, 2100) and the RNA
25
integrity number (RIN) was calculated for each sample. Only samples with a RIN higher than
7 were used for library construction. Non-oriented RNA-seq libraries were constructed from 4
µg of total RNA using the IlluminaTruSeqTM RNA sample preparation Kit (Illumina,
#15008136) according to the manufacturer's protocol, with a library insert fragmentation time
of 12 min. Illumina index were used to pool two samples per lane. Libraries were sequenced
on an Illumina HiSeq2000 with 2 x 100-bp paired-end reads. Read quality was checked with
the FastQC v0.10.0 software (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
3.2. Read mapping, expression analysis, and detection of alternative splicing
RNA-Seq Illumina reads obtained from the 30 samples (see above) were mapped on the
chromosome 3B scaffolds using Tophat2 v2.0.8 (81) and bowtie2 (82) with the default
parameters except: 0 mismatch, 0 splice-mismatch. PCR duplicates were removed with
Samtools (83) with the rmdup option and an annotation-guided read alignment was performed
with Cufflinks v2.1.1 (84) to reconstruct transcripts and estimate transcript abundance in units
of fragments per kb of exon per million mapped reads (FPKM, (85)). Regions with FPKM
values higher than zero were considered as expressed. RNA-Seq data revealed the presence of
expressed regions that were not annotated by TriAnnot and were called "novel transcribed
regions" (NTRs). Alternative spliced (AS) transcripts were identified using Cufflinks (default
parameters; option -g) and all isoforms were analyzed with MATS (86).
3.3. Segmentation/change-point analysis
Segmentation analyses were performed using the R package changepoint v1.0.6 (54) with
Segment Neighbourhoods method and BIC penalty on the mean change. Segmentation was
applied to the distribution (sliding window size: 10 Mb, step: 1 Mb) along the chromosome of
the following features: recombination rate, TE content, gene density, and expression breadth.
26
4. Chromosome partitioning in barley and maize
4.1. Partitioning of the barley chromosomes
We retrieved publicly available genetic recombination and gene expression data as well as
GO-terms, and gene positions along the 7 barley chromosomes from Mayer et al. (34)
(available for download at http://mips.helmholtz-muenchen.de/plant/barley/download/). We
performed the same analysis as for chromosome 3B by computing the pattern of
recombination rate along the chromosomes in a 10 Mb sliding window with a step of 1 Mb. In
each window, the largest distance between 2 markers and their relative position in cM were
considered to calculate a ratio of cM per Mb. Segmentation analysis was then applied (see
paragraph "segmentation/change-point analysis") in order to identify the borders between the
distal high-recombination and the proximal low-recombination regions. A threshold of 0.40
cM/Mb was used to define high versus low-recombination segments (the average for the
whole genome was 0.25 cM/Mb). The coordinates (in Mb) of the high-recombination
segments are the following: chr1: start-37, 388-end; chr2: start-60, 465-end; chr3: start-21,
466-end; chr4: start-42, 462-end; chr5: start-33, 388-end; chr6: start-27, 488-end; chr7: start-
89, 505-end. For each chromosome, a GO-term enrichment analysis was applied as described
in section "GO term annotation and enrichment analysis". We computed the average
expression breath (out of 8 tested conditions) in a 10 Mb sliding window with a step of 1 Mb.
Average was calculated only with genes expressed in at least 1 condition. As described by the
authors (34), we used a FPKM threshold of 0.4 to consider that a gene is expressed. We then
performed the segmentation analysis as described above. The coordinates of high/low-
recombination regions were considered to calculate the average expression breadth in high
versus low-recombination regions and a Welch t-test was performed to validate the statistical
significance of the differences observed. The pattern of expression breadth along the 7 barley
27
chromosomes and the defined segments are represented in Fig. S4A. As observed for
chromosome 3B, we found that high-recombination regions in barley carry genes expressed in
fewer conditions (5.9/8) than low-recombination regions (6.7/8). This difference is
statistically significant (p-value<0.01) for 6 out the 7 chromosomes (exception is chromosome
4H).
4.2. Partitioning of the maize chromosomes
We retrieved publicly available data of the IBM (B73×Mo17) population from Ganal et al.
(33) in order to investigate genetic recombination rate along the 10 maize chromosomes. We
computed the pattern of recombination rate along the chromosomes in a 10 Mb sliding
window with a step of 1 Mb following the same approach described above for barley and
applied a segmentation analysis. A threshold of 1.5 cM/Mb was used to define the borders of
the high-recombination regions (the average for the whole genome was 0.8 cM/Mb). The
coordinates (in Mb) of the high-recombination segments are the following: chr1: start-9, 276-
end; chr2: start-17, 178-end; chr3: start-13, 200-end; chr4: start-11; chr5: start-14; chr6: start-
7, 141-end; chr7: start-10, 153-end; chr8: start-20, 152-end; chr9: start-21, 130-end; chr10:
start-5, 125-end. GO-term enrichment was applied has described above (the maize functional
annotation with GO-terms was retrieved at http://ftp.maizesequence.org/release-4a.53/).
Segmentation based on the average expression breadth of genes was performed using publicly
available expression data in 18 conditions from Sekhon et al. (35). The average expression
breath (considering expressed genes only) was computed in a 10 Mb sliding window with a
step of 1 Mb and we computed the corresponding segmentation. The coordinates of high/low-
recombination regions were considered to calculate the average expression breadth in these
two types of regions. A Welch t-test was performed to validate the statistical significance of
the differences observed. The pattern of average expression breadth along the 10 maize
28
chromosomes and the defined segments are represented in Fig. S4B. Contrary to what was
observed for chromosome 3B, in maize, high and low-recombination regions carry genes with
similar expression breadth: 13.2 and 12.7 (/18), respectively.
5. Comparative analyses
5.1. Identification of syntenic and nonsyntenic genes on chromosome 3B
Three fully sequenced grass species were chosen for the comparative analyses: rice (Os;
source: MSU version 7.0), Brachypodium (Bd; source: Brachypodium Sequencing Initiative,
2.0), and sorghum (Sb; source: phytozome, version 1.4). All TEs and alternative splice
variants were removed from the gene sets of all species. Aminoacid sequences were used in
an all-by-all BLAST (cutoff e-value of 1e-5) between the proteomes from each of the species
compared. Syntenic genes in each of the species were defined as genes with the reciprocal
best BLAST hit (RBH) on an orthologous chromosome in at least one other species (Ta3B,
Bd2, Os1 or Sb3). The exact borders of the Brachypodium orthologous region on
chromosome 2 were defined by visualizing the reciprocal best BLAST hits (RBHs) with rice
chromosome 1 under the Artemis Comparison Tool (87). Nonsyntenic genes in each species
were defined as genes for which best BLAST hit was on a non-orthologous chromosome in
the compared species. An extra round of filtration was applied to the gene sets in order to
remove lineage-specific genes and possible mis-annotations. The filtration consisted of
removing all genes from each species with no homology to at least one other gene in a
compared species (at least 35% amino acid identity, and 35% sequence overlap). A Venn
diagram was constructed by counting the number of genes in each species with their best
BLAST hits on orthologous chromosomes in none, one, two, or three other species. In order
to compare the 3B annotation to the barley 3H annotation (34), we searched for similarity
between the 3B genes and the 2,478 gene models anchored on 3H (Evalue cutoff: 1e-5). We
29
then counted all the syntenic or nonsyntenic genes with a 3H hit having at least 35% identity,
and 35% query and hit overlap.
5.2. Collinearity
Collinearity was detected between the genes located on chromosomes 3B, Os1, Bd2, and Sb3
using the program MCScanX (88). A collinear block was defined as a conserved set of at least
5 genes (anchors) in the same order between 2 genomes, with a maximum of 25 spacer genes
between the anchors in a collinear block. All the amino acid sequences of the filtered gene
sets of each species were used in an all-by-all BLASTp comparison (using E-value cutoff at
1e-10). Collinearity was visualized using the Circos software (89).
5.3. Intra- and Inter-chromosomal duplications
The percentage of intra-chromosomal duplicates (paralogs) on 3B was determined using the
software OrthoMCL (e-value cutoff: 1e-5, percent match cutoff: 35%) (55). The software
produced clustered families of putative orthologs (homologs between species, originating
from the common ancestor) and paralogs (duplicates within a species) on the basis of
sequence similarity. Therefore, we classified all 3B genes that were clustered into the same
family as intra-chromosomal duplicates, i.e. genes with at least one other member in its family
on the same chromosome. 3B genes clustered in a family with wheat gene models annotated
on another chromosome (19), not including genes from group 3, were considered as inter-
chromosomal duplicates. Tandem duplicates were defined as genes in the same family with 5
or less spacer genes separating them on the pseudomolecule, and dispersed duplicates were
defined as having more than 5 spacer genes.
30
In order to better understand the nonsyntenic gene origin, we searched for the ancestral locus
for each nonsyntenic gene, i.e. the “parental” gene on another chromosome that has been
duplicated and inserted onto 3B. To do this, we used the gene models defined from the CSS
contigs of the 20 other chromosomes (19). We used the program
detect_collinearity_within_gene_families.pl (part of the MCScanX package) and input the
OrthoMCL-determined families consisting of clusters containing 3B nonsyntenic genes and
rice, Brachypodium, sorghum, and wheat-non group 3 homologs. If the best BLAST hits of a
3B nonsyntenic gene were 1) clustered into the same family, 2) collinear among each other,
and 3) collinear with another CSS gene model in the same family, we considered this CSS
gene model as the parental copy. When several parental genes were detected (for example on
the A, B, and D genomes), the one showing the highest score was chosen as the parent. This
resulted in 152 genes for which a parent gene was defined.
To estimate the exact proportion of nonsyntenic 3B genes that originate from inter-
chromosomal duplication while avoiding underestimation due to annotation problems, we
searched for similarity with the full set of CSS assembled contigs of the 18 non
homoeologous wheat chromosomes (not limited to gene models). By using the parameters of
at least 80% nucleotide identity and at least 100 bp aligned as thresholds, we were able to
estimate that 1793 of 2065 (87%) nonsyntenic genes share significant similarity with contigs
from a non homoeologous chromosome and, thus, may originate from inter-chromosomal
duplication. For some of them, no gene model was annotated (partially assembled, split into
several contigs, etc.) and we were not able to detect an annotated parent gene copy using
OrthoMCL clusters only.
31
5.4. Calculation of synonymous (Ks) and nonsynonymous (Ka) substitution rates
Ka and Ks rates were calculated by, first, removing pseudogenes and then comparing the
coding sequence of a nonsyntenic gene and its parent copy. Alignments were made with
ClustalW version 2.1 (56). Rates were calculated by the Nei and Gojobori method using
codeml (part of the PAML package; (57)). Age of gene divergence was estimated by the
equation Ks/2r, where r=6.5e-9.
6. Construction of a genetic map and LD mapping
6.1. Plant material and DNA extraction
Deletion mapping of the SNPs was performed using cytogenetic stocks of cv. Chinese Spring
including a nullisomic 3B-tetrasomic 3A line (90), two ditelosomic 3B lines (91), and 14
deletion lines (3BS-3, 3BS-8, 3BS-7, 3BS-9, 3BS-2, 3BS-4, 3BS-1, 3BS-5, 3BL-2, 3BL-8,
3BL-1, 3BL-9, 3BL-10, 3BL-7) (92). The CsRe single seed descent (SSD) population was
derived from a cross between Chinese Spring (Cs) and Renan (Re) using Cs as female parent.
Twelve F1 plants were selfed to produce F2 seeds among which ~1,500 were sown and selfed
to produce F3 families. Fifteen seeds of each family were then sown and the second plant of
each line was systematically selected and selfed to produce F4 seeds. The same procedure was
applied at each generation until F8 families were obtained. The CsRe SSD population consists
in 1,269 individuals among which a set of 305 was randomly chosen for genetic mapping of
SNPs. DNA was extracted as described in Graner et al. (93) on bulks of leaves collected on 10
F7-seedlings and is thus representative of the F6 generation. Two collections of wheat lines
were used for linkage disequilibrium (LD) mapping: 367 accessions originating from around
the world and representing 98% of the world diversity as estimated using a set of SSR
markers (94). Since this collection exhibits a low structuration, LD values are thus relatively
32
low (95); 353 wheat varieties derived from elite European material in which LD values are
much higher were selected.
6.2. Genetic mapping
A genetic map of chromosome 3B was constructed using 305 individuals selected from the
CsRe SSD population. Linkage estimation was based on the maximum likelihood method
using CarthaGene (http://www7.inra.fr/mia/T/CarthaGene/) (96) with LOD and ө values of 5
and 0.25 respectively using the Kosambi mapping function to transform recombination
fractions into centimorgans. The chromosome 3B-consensus map was built following the
same strategy described for the IBM map in maize (97) and for the initial physical map of
chromosome 3B (17) using the CsRe-SSD genetic map from chromosome 3B described
above as a framework map on which the position of loci mapped in another population was
extrapolated. The consensus map was constructed using segregating data from the following
40 mapping populations: Chinese Spring x Renan (SSD population) as a framework; the CsRe
F2 population used to anchor the initial version of the physical map of chromosome 3B (17);
18 DH, RIL and F2 wheat Australian populations used for DArT marker genotyping (98); a
composite wheat map integrating 12 maps (http://wheat.pw.usda.gov/cmap/); the new ITMI
population (DH lines) (99); two DH populations Apache x Balance and Alchemy x Robigus
developed respectively by Florimond-Desprez and RAGT; the Chinese Spring x Courtot DH
population (100); RL4452 x AC Domain, SC8021V2 x AC Karma, (two populations from
Agriculture Canada (101)); two populations (Avalon x Cadenza and Savannah x Rialto) used
to map SNP markers developed in the course of the CerealsDB project
(http://www.cerealsdb.uk.net/CerealsDB/SNPS/Documents/DOC_snps.php; (102-104).
33
6.3. Linkage Disequilibrium (LD) mapping
LD mapping aims at ordering markers according to their most likely position on the
pseudomolecule based on the fact that correlation coefficient (r²) values are higher when
markers are physically close. The initial order of the markers was based first on deletion-bin
mapping and second on genetic mapping. The LD-mapping strategy was applied when genetic
mapping failed to find recombinant individuals between two or more markers located in the
same genetic bin. For each subset of markers located at the same genetic distance and in the
same deletion bin, r² values were computed using Tassel 4.1.32 software (105). The data file
was filtered for rare alleles (a percentage less than 5% in the whole population), thus LD
values cannot be biased by low allele frequencies. In order to make results clearer, r² values
between pair of markers were calculated in two contrasted populations exhibiting different
levels and extent of LD. Then, markers were ordered according to their respective r² values
using homemade software based on the salesman problem. The advantage of such a method is
that it also permits defining the position of new markers that are not polymorphic on mapping
populations. Finally, the position of each marker in each block was checked and manually
curated when necessary and LD blocks were numbered according to their most likely position
on the genetic map. Markers for which it was not possible to define an optimal order were
attributed the same LD block number.
7. metaQTL analysis and projection of the confidence intervals on the 3B
pseudomolecule
A survey of the literature and of our own data identified 121 quantitative trait loci (QTLs) for
50 different traits with r² values ranging from 0.01 to 0.48 and confidence intervals from 5 to
121 cM on chromosome 3B (Table S12). The genetic maps used for QTL detection comprised
between 3 and 51 markers among which 50% to 100% of the markers were found on our 3B
34
genetic consensus map. MetaQTL analysis was computed with Biomercator (106). It allowed
the projection of 116 of the 121 QTLs on this map. The most likely metaQTL model (BIC
criteria) identified five metaQTLs with confidence intervals ranging from 1.8 to 11.9 cM.
Thirteen metaQTLs reported in the literature with confidence intervals ranging from 0 to 49.6
cM were added leading to a total of 18 metaQTLs on chromosome 3B (Table S13). By
comparing the sequence of the markers flanking the confidence intervals, each metaQTL
interval was assigned to a sequenced region. The 18 metaQTLs covered between 1.5 and 620
Mb.
35
Supplementary tables
Table S1: Features of the different assemblies from chromosome 3B sequences from the first raw assembly up to the pseudomolecule.
raw assembly
manual curation of scaffolding
automated gap filling
assignation + removal of
contamination
scaffold assembly based on
known overlap
scaffold assembly based on unknown overlap
(within each deletion bin)
scaffold assembly based
on overlap predicted by shared TE-junctions
pseudo-molecule
v2.1 v3.0 v4.0 v4.1.2 v4.2.2 v4.3.2 v4.4.3
all contigs contigs in scaffolds
contigs cumulated size (bp) 986,092,508 855,419,705 869,221,962 916,084,767 915,399,659 904,766,638 852,223,227 773,308,608 723,118,112 number 294,691 130,521 149,199 54,720 54,672 53,803 49,157 43,267 37,954 N50 (bp) 8,590 11,920 11,887 41,163 41,076 41,356 42,475 44,324 45,448 L50 25,939 18,085 18,420 6,250 6,264 6,142 5,639 4,923 4,509 average size (bp) 3,346 6,554 5,826 16,741 16,743 16,816 17,337 17,873 19,052 max size (bp) 163,292 163,292 163,292 512,434 512,434 512,452 512,452 512,452 512,452 scaffolds cumulated size (bp) 1,040,382,486 995,646,481 992,866,407 992,338,434 980,625,305 920,921,491 832,800,924 774,434,471 number 16,136 5,095 5,109 4,999 4,747 3,769 2,808 1,358 N50 (bp) 274,769 464,172 462,955 462,955 494,575 639,215 892,435 949,321 L50 1,048 682 683 683 606 450 296 264 average size (bp) 64,476 195,416 194,337 198,507 206,578 244,341 296,582 570,176 max size (bp) 1,338,329 1,648,264 1,638,993 1,638,993 2,795,397 3,884,599 4,169,843 4,169,843 amount of Ns 184,963,128 126,424,892 76,845,121 77,002,074 75,921,345 68,757,059 59,545,911 51,365,845 % of Ns 17.8% 12.7% 7.7% 7.8% 7.7% 7.5% 7.2% 6.6%
36
Table S2: Table of the Zadoks decimal code for wheat growth stages (107). Stages with a
cross were used for the RNA-Seq analysis
Stage Wheat growth stage Feekes scale
Zadoks scale
Leaves Root Stem Spike Grain
Seedling First leaf through coleoptile
1 10 X X
Three leaves 3 leaves unfolded 13 X Three tillers Main shoot and 3 tillers 23 X Spike at 1 cm Pseudostem erection 5 30 X Two nodes 2nd detectable node 7 32 X X Meiosis Flag leaf ligule and ollar
visible 9 39 X X
Anthesis 1/2 of flowering complete 65 X X 2 DAAs (50°C.days)
Kernel (caryopsis) watery ripe
71 X X
14 DAAs (350°C.days)
Medium Milk 75 X
30 DAAs (700°C.days)
Soft dough 85 X
37
Table S3: Overrepresented R1 GO terms. The table is divided into three sections: Cellular
component (C), molecular function (F), and biological process (P). The Depth is the depth in
the GO hierarchy. GO term redundancy was removed by GO trimming. Counts of genes and
adjusted p-values are shown for each segment (R1, R2, R3). Significant p-values are in bold.
Term Name GOID Term Type Depth
# genes (R1)
adjusted R1 p-value
# genes (R2)
adjusted R2 p-value
# genes (R3)
adjusted R3 p-value
membrane-bounded vesicle GO:0031988 C 4 267 1.01E-12 500 1.00E+00 169 2.08E-01 vesicle GO:0031982 C 3 267 1.11E-12 501 1.00E+00 169 2.08E-01 cytoplasmic vesicle GO:0031410 C 4 267 1.11E-12 501 1.00E+00 169 2.08E-01 intracellular membrane-bounded organelle GO:0043231 C 4 510 6.08E-08 1369 1.00E+00 326 1.00E+00 organelle GO:0043226 C 2 525 1.09E-06 1456 1.00E+00 339 1.00E+00 intracellular organelle GO:0043229 C 3 525 1.09E-06 1456 1.00E+00 339 1.00E+00 intracellular part GO:0044424 C 3 552 1.39E-06 1563 1.00E+00 348 1.00E+00 ubiquitin ligase complex GO:0000151 C 4 8 2.20E-02 5 1.00E+00 0 1.00E+00 small conjugating protein ligase activity GO:0019787 F 6 75 2.32E-30 23 1.00E+00 5 1.00E+00 protein kinase activity GO:0004672 F 6 170 3.43E-20 223 1.00E+00 56 1.00E+00 transferase activity GO:0016740 F 3 240 8.97E-11 493 1.00E+00 115 1.00E+00 adenyl ribonucleotide binding GO:0032559 F 6 227 1.47E-07 460 1.00E+00 161 4.82E-02 purine nucleoside binding GO:0001883 F 4 233 5.32E-07 485 1.00E+00 169 4.14E-02 enzyme regulator activity GO:0030234 F 2 38 1.39E-06 41 1.00E+00 5 1.00E+00 catalytic activity GO:0003824 F 2 516 1.43E-05 1495 1.00E+00 300 1.00E+00 carboxylesterase activity GO:0004091 F 5 23 1.28E-03 25 1.00E+00 5 1.00E+00 endopeptidase inhibitor activity GO:0004866 F 5 16 3.63E-03 17 1.00E+00 0 1.00E+00 binding GO:0005488 F 2 572 2.10E-02 1737 1.00E+00 421 1.00E+00 post-translational protein modification GO:0043687 P 6 243 1.55E-38 266 1.00E+00 67 1.00E+00 protein modification process GO:0006464 P 5 249 4.60E-37 289 1.00E+00 71 1.00E+00 modification-dependent protein catabolic process GO:0019941 P 7 74 4.20E-32 17 1.00E+00 6 1.00E+00 cellular protein metabolic process GO:0044267 P 5 285 1.54E-29 429 1.00E+00 97 1.00E+00 protein modification by small protein conjugation GO:0032446 P 7 71 1.81E-29 20 1.00E+00 5 1.00E+00 cellular catabolic process GO:0044248 P 4 87 4.83E-23 58 1.00E+00 10 1.00E+00 developmental process GO:0032502 P 2 83 1.84E-22 51 1.00E+00 12 1.00E+00 protein metabolic process GO:0019538 P 4 308 3.79E-22 557 1.00E+00 120 1.00E+00 multicellular organismal process GO:0032501 P 2 86 9.76E-20 62 1.00E+00 18 1.00E+00 cellular macromolecule metabolic process GO:0044260 P 4 394 7.54E-19 866 1.00E+00 155 1.00E+00 phosphorylation GO:0016310 P 6 174 1.56E-14 286 1.00E+00 58 1.00E+00 cellular metabolic process GO:0044237 P 3 462 5.28E-13 1190 1.00E+00 193 1.00E+00 primary metabolic process GO:0044238 P 3 493 3.45E-10 1331 1.00E+00 238 1.00E+00 lipid localization GO:0010876 P 4 17 9.79E-06 9 1.00E+00 0 1.00E+00 DNA metabolic process GO:0006259 P 5 43 1.25E-02 74 1.00E+00 21 1.00E+00 cellular amino acid metabolic process GO:0006520 P 4 29 1.30E-02 47 1.00E+00 7 1.00E+00 cellular amine metabolic GO:0044106 P 5 29 1.60E-02 48 1.00E+00 7 1.00E+00
38
process developmental maturation GO:0021700 P 3 12 2.45E-02 13 1.00E+00 0 1.00E+00 aromatic amino acid family metabolic process GO:0009072 P 5 5 2.98E-02 1 1.00E+00 0 1.00E+00 L-phenylalanine metabolic process GO:0006558 P 6 4 2.99E-02 0 1.00E+00 0 1.00E+00 aromatic amino acid family catabolic process GO:0009074 P 6 4 2.99E-02 0 1.00E+00 0 1.00E+00 response to abiotic stimulus GO:0009628 P 3 19 3.34E-02 30 1.00E+00 1 1.00E+00
Table S4: Overrepresented R2 GO terms. The table is divided into three sections: Cellular
component (C), molecular function (F), and biological process (P). The Depth is the depth in
the GO hierarchy. GO term redundancy was removed by GO trimming. Counts of genes and
adjusted p-values are shown for each segment (R1,R2,R3). Significant p-values are in bold.
Term Name GOID Term Type Depth
#genes (R1)
adjusted R1 p-value
# genes (R2)
adjusted R2 p-value
# genes (R3)
adjusted R3 p-value
macromolecular complex GO:0032991 C 2 38 1.00E+00 254 1.60E-05 32 1.00E+00 protein complex GO:0043234 C 3 25 1.00E+00 155 1.13E-04 12 1.00E+00 membrane part GO:0044425 C 3 29 1.00E+00 194 3.06E-04 25 1.00E+00 intrinsic to membrane GO:0031224 C 4 17 1.00E+00 134 2.36E-03 19 1.00E+00 proton-transporting ATP synthase complex GO:0045259 C 4 1 1.00E+00 22 1.41E-02 0 1.00E+00 ribonucleoprotein complex GO:0030529 C 3 10 1.00E+00 83 4.54E-02 13 1.00E+00 nucleoside-triphosphatase activity GO:0017111 F 7 10 1.00E+00 144 9.80E-08 11 1.00E+00 hydrolase activity GO:0016787 F 3 93 1.00E+00 586 1.23E-07 101 1.00E+00 hydrolase activity, acting on acid anhydrides GO:0016817 F 4 13 1.00E+00 174 7.63E-07 21 1.00E+00 ATPase activity, coupled GO:0042623 F 9 3 1.00E+00 76 3.94E-06 4 1.00E+00 exonuclease activity GO:0004527 F 6 2 1.00E+00 56 1.13E-04 3 1.00E+00 P-P-bond-hydrolysis-driven transmembrane transporter activity GO:0015405 F 6 2 1.00E+00 50 5.38E-04 3 1.00E+00 nuclease activity GO:0004518 F 5 8 1.00E+00 80 6.37E-04 6 1.00E+00 hydrolase activity, acting on acid anhydrides, catalyzing transmembrane movement of substances GO:0016820 F 4 2 1.00E+00 47 1.25E-03 3 1.00E+00 ATPase activity, coupled to movement of substances GO:0043492 F 10 2 1.00E+00 47 1.25E-03 3 1.00E+00 helicase activity GO:0004386 F 8 3 1.00E+00 38 1.47E-03 0 1.00E+00 nucleic acid binding GO:0003676 F 3 85 1.00E+00 422 2.34E-03 75 1.00E+00 purine NTP-dependent helicase activity GO:0070035 F 9 1 1.00E+00 26 3.56E-03 0 1.00E+00 monovalent inorganic cation transmembrane transporter activity GO:0015077 F 8 7 1.00E+00 44 1.38E-02 0 1.00E+00
39
transcription regulator activity GO:0030528 F 2 28 1.00E+00 142 1.43E-02 16 1.00E+00 antioxidant activity GO:0016209 F 2 2 1.00E+00 45 2.31E-02 6 1.00E+00 oxidoreductase activity, acting on peroxide as acceptor GO:0016684 F 4 2 1.00E+00 45 2.31E-02 6 1.00E+00 cation transmembrane transporter activity GO:0008324 F 6 8 1.00E+00 62 2.55E-02 6 1.00E+00 cellular biosynthetic process GO:0044249 P 4 105 1.00E+00 540 9.80E-08 67 1.00E+00 cellular macromolecule biosynthetic process GO:0034645 P 5 88 1.00E+00 435 3.59E-06 52 1.00E+00 gene expression GO:0010467 P 4 85 1.00E+00 418 1.60E-05 53 1.00E+00 regulation of cellular process GO:0050794 P 3 65 1.00E+00 344 2.20E-05 43 1.00E+00 regulation of transcription, DNA-dependent GO:0006355 P 6 56 1.00E+00 267 2.46E-04 27 1.00E+00 generation of precursor metabolites and energy GO:0006091 P 4 19 1.00E+00 112 1.24E-03 7 1.00E+00 purine ribonucleotide metabolic process GO:0009150 P 7 1 1.00E+00 43 1.25E-03 3 1.00E+00 energy coupled proton transport, down electrochemical gradient GO:0015985 P 5 1 1.00E+00 29 1.47E-03 0 1.00E+00 ATP biosynthetic process GO:0006754 P 10 1 1.00E+00 29 1.47E-03 0 1.00E+00 purine ribonucleoside triphosphate metabolic process GO:0009205 P 8 1 1.00E+00 40 2.57E-03 3 1.00E+00 cellular localization GO:0051641 P 3 5 1.00E+00 63 3.10E-03 6 1.00E+00 purine-containing compound biosynthetic process GO:0072522 P 6 1 1.00E+00 39 3.33E-03 3 1.00E+00 nucleobase, nucleoside, nucleotide and nucleic acid metabolic process GO:0006139 P 4 116 1.00E+00 464 3.57E-03 66 1.00E+00 cation transport GO:0006812 P 5 5 1.00E+00 55 4.24E-03 4 1.00E+00 oxidative phosphorylation GO:0006119 P 5 7 1.00E+00 54 5.48E-03 2 1.00E+00 signaling GO:0023052 P 2 7 1.00E+00 80 6.22E-03 11 1.00E+00 protein transport GO:0015031 P 4 4 1.00E+00 52 8.97E-03 5 1.00E+00 cellular protein localization GO:0034613 P 5 4 1.00E+00 52 8.97E-03 5 1.00E+00 protein localization GO:0008104 P 4 6 1.00E+00 58 9.11E-03 5 1.00E+00 intracellular transport GO:0046907 P 4 5 1.00E+00 55 1.84E-02 6 1.00E+00 cellular response to stimulus GO:0051716 P 3 10 1.00E+00 85 1.95E-02 12 1.00E+00 intracellular signal transduction GO:0035556 P 4 4 1.00E+00 42 2.30E-02 3 1.00E+00 cellular component organization or biogenesis GO:0071840 P 2 14 1.00E+00 111 4.08E-02 20 1.00E+00 nucleic acid metabolic process GO:0090304 P 5 112 1.00E+00 402 4.75E-02 54 1.00E+00 nucleotide biosynthetic process GO:0009165 P 6 1 1.00E+00 42 4.86E-02 7 1.00E+00 protein complex subunit organization GO:0071822 P 5 2 1.00E+00 26 4.90E-02 1 1.00E+00
40
Table S5: Overrepresented R3 GO terms. The table is divided into three sections: Cellular
component (C), molecular function (F), and biological process (P). The Depth is the depth in
the GO hierarchy. GO term redundancy was removed by GO trimming. Counts of genes and
adjusted p-values are shown for each segment (R1, R2, R3). Significant p-values are in bold.
Term Name GOID Term Type Depth
#genes (R1)
adjusted R1 p-value
#genes (R2)
adjusted R2 p-value
#genes (R3)
adjusted R3 p-value
purine nucleoside binding GO:0001883 F 4 233 5.32E-07 485 1.00E+00 169 4.14E-02 secondary active transmembrane transporter activity GO:0015291 F 5 10 1.00E+00 10 1.00E+00 17 2.01E-03 adenyl ribonucleotide binding GO:0032559 F 6 227 1.47E-07 460 1.00E+00 161 4.82E-02 response to stimulus GO:0050896 P 2 66 1.00E+00 239 1.00E+00 125 2.13E-12 response to stress GO:0006950 P 3 47 1.00E+00 150 1.00E+00 101 2.10E-14 programmed cell death GO:0012501 P 4 34 1.00E+00 40 1.00E+00 85 1.48E-27 drug transport GO:0015893 P 4 9 8.90E-01 5 1.00E+00 14 2.91E-03 sphingoid metabolic process GO:0046519 P 7 0 1.00E+00 1 1.00E+00 5 4.66E-02
41
Table S6: Number and proportion of filtered syntenic and nonsyntenic genes in the
pseudomolecule of wheat chromosome 3B, and the orthologous chromosomes in
Brachypodium, rice, and sorghum. Syntenic genes have their best BLAST hit on an
orthologous chromosome in at least one compared species.
Species (chromosome)
Syntenic genes Nonsyntenic genes
Count % Count %
T. aestivum (3B) 3899 65.4% 2065 34.6%
B. distachyon (2) 3310 95.7% 149 4.3%
O. sativa (1) 3601 94.8% 196 5.2%
S. bicolor (3) 3449 94.3% 207 5.7%
Table S7: Proportion of collinear genes between orthologous chromosomes of four grass
species. The values indicate the percent of genes in species 1 that are collinear with the
genome of species 2 (tae: T. aestivum; bdi: B. distachyon; osa: O. sativa; sbi: S. bicolor).
Species 1 Species 2
tae3B bdi2 osa1 sbi3
tae3B 50.7% 42.2% 41.9%
bdi2 29.4% 62.5% 63.0%
osa1 26.9% 68.6% 68.1%
sbi3 25.7% 66.6% 65.6%
42
Table S8: Proportion of expressed genes on the 3B pseudomolecule and expression breadth.
Percentage of expressed genes are indicated for syntenic genes, nonsyntenic genes, singletons
(i.e., genes with no evidence of intra-chromosomal duplication origin), and genes potentially
originating from intra-chromosomal duplication.
syntenic nonsyntenic singletons
Intra-chr. duplicates
% genes expressed in at least 1 condition (/15)
81.9% 68.6% 83.6% 61.6%
average number of conditions (/15) with expression evidence*
12.0±4.6 9.2±5.5 11.8±4.7 9.4±5.4
% pseudogenes/gene fragments 17.2% 31.6% 22.0% 22.6% *Out of the expressed genes
43
Table S9: GO term enrichment analysis between syntenic and nonsyntenic genes. The table is
divided into three sections: Cellular component (C), molecular function (F), and biological
process (P). The Depth is the depth in the GO hierarchy. GO term redundancy was removed
by GO trimming. Counts of genes (#) and adjusted p-values are shown for syntenic and
nonsyntenic genes. Significant p-values are in bold.
all
genes nonsyntenic
genes syntenic genes Term Name GOID Depth # # p-value # p-value
BIOLOGICAL PROCESSES developmental process GO:0032502 2 146 16 1.0000 109 0.0000 modification-dependent protein catabolic process
GO:0019941 7 97 13 1.0000 77 0.0000
cellular catabolic process GO:0044248 4 155 22 1.0000 114 0.0000 multicellular organismal process GO:0032501 2 166 23 1.0000 120 0.0000 protein modification by small protein conjugation
GO:0032446 7 96 15 1.0000 75 0.0000
regulation of biological process GO:0050789 2 510 131 1.0000 307 0.0068 nucleic acid metabolic process GO:0090304 5 568 134 1.0000 339 0.0131 regulation of metabolic process GO:0019222 3 422 100 1.0000 255 0.0251 cellular macromolecule metabolic process
GO:0044260 4 1415 443 1.0000 795 0.0422
regulation of cellular process GO:0050794 3 452 114 1.0000 270 0.0463 programmed cell death GO:0012501 4 159 103 0.0000 42 1.0000 lipid localization GO:0010876 4 26 21 0.0000 4 1.0000 chromatin assembly GO:0031497 5 25 20 0.0000 5 1.0000 nucleosome organization GO:0034728 6 25 20 0.0000 5 1.0000 protein-DNA complex assembly GO:0065004 6 25 20 0.0000 5 1.0000 protein-DNA complex subunit organization
GO:0071824 5 25 20 0.0000 5 1.0000
response to stress GO:0006950 3 298 137 0.0000 113 1.0000 chromatin organization GO:0006325 7 42 27 0.0038 14 1.0000 macromolecular complex assembly GO:0065003 5 53 32 0.0038 17 1.0000 cellular component assembly at cellular level
GO:0071844 5 37 24 0.0038 10 1.0000
respiratory electron transport chain GO:0022904 5 33 22 0.0038 0 1.0000 oxidative phosphorylation GO:0006119 5 63 36 0.0038 13 1.0000 phosphorylation GO:0016310 6 518 214 0.0063 261 1.0000
44
cellular component assembly GO:0022607 4 55 32 0.0063 18 1.0000 macromolecular complex subunit organization
GO:0043933 4 55 32 0.0063 19 1.0000
photosynthesis, light reaction GO:0019684 5 15 12 0.0089 2 1.0000 cellular component organization at cellular level
GO:0071842 4 78 40 0.0306 32 1.0000
cellular component organization GO:0016043 3 105 51 0.0326 45 1.0000 viral infectious cycle GO:0019058 4 6 6 0.0353 0 1.0000 transmembrane transport GO:0055085 3 59 31 0.0570 18 1.0000
CELLULAR COMPONENT intracellular membrane-bounded organelle
GO:0043231 4 2205 678 1.0000 1236
0.0000
intracellular non-membrane-bounded organelle
GO:0043232 4 154 83 0.0000 55 1.0000
extracellular region GO:0005576 2 42 30 0.0000 12 1.0000 ribosome GO:0005840 4 34 25 0.0000 5 1.0000 protein-DNA complex GO:0032993 3 25 20 0.0000 5 1.0000 chromatin GO:0000785 5 27 21 0.0000 5 1.0000 intracellular organelle part GO:0044446 3 189 91 0.0000 73 1.0000 macromolecular complex GO:0032991 2 324 141 0.0063 134 1.0000 ribonucleoprotein complex GO:0030529 3 106 54 0.0063 37 1.0000 external encapsulating structure GO:0030312 3 21 15 0.0145 6 1.0000
MOLECULAR FUNCTION small conjugating protein ligase activity
GO:0019787 6 103 17 1.0000 77 0.0000
tetrapyrrole binding GO:0046906 3 145 82 0.0000 49 1.0000 purine nucleoside binding GO:0001883 4 887 369 0.0000 427 1.0000 iron ion binding GO:0005506 7 157 84 0.0000 56 1.0000 adenyl ribonucleotide binding GO:0032559 6 848 352 0.0000 411 1.0000 structural molecule activity GO:0005198 2 98 52 0.0038 33 1.0000 oxidoreductase activity GO:0016491 3 310 135 0.0063 119 1.0000 NADH dehydrogenase (quinone) activity
GO:0050136 6 39 23 0.0306 6 1.0000
45
Table S10: Inter-chromosomal duplicates in wheat chromosome 3B and related species.
Species (chromosome)
# gene families*
# duplicated
gene families # genes
(interdup)
% inter-dup (% of total
families)
% inter-dup (% of total
genes) T. aestivum (3B) 3949 1321 2032 33.4% 34.1% B. distachyon (2) 3008 670 841 22.3% 24.3% O. sativa (1) 3053 659 859 21.6% 22.6% S. bicolor (3) 3087 646 824 20.9% 22.5% *OrthoMCL clusters of paralogs or orthologs when only considering the genes from wheat 3B, rice 1, B. distachyon 2, and sorghum 3.
Table S11: Intra-chromosomal duplicates in wheat chromosome 3B and related species.
Species (chromosome)
# gene families*
# duplicated
gene families
# duplicated
genes
% duplicates
(out of total # families)
% duplicates
(out of total # genes)
Average (standard deviation) number of
duplicates per family
Maximum # duplicates per family
T. aestivum (3B) 3949 809 2216 20.5% 37.2% 2.7±2.0 31 B. distachyon (2) 3008 215 529 7.1% 15.3% 2.5±1.1 13 O. sativa (1) 3053 242 669 7.9% 17.6% 2.8±1.9 17 S. bicolor (3) 3087 241 592 7.8% 16.2% 2.5±1.2 15
Table S12: Tandem and dispersed duplicates in wheat chromosome 3B and related species.
Species (chromosome)
# tandem duplicates
% tandem duplicates
(out of total # genes)
% tandem duplicates
(out of # intrachr.
duplicates) # dispersed duplicates
% dispersed duplicates
(out of total # genes)
% dispersed duplicates
(out of # intrachr.
duplicates) T. aestivum (3B) 1022 17.1% 46.1% 1194 20.0% 53.9% B. distachyon (2) 293 8.5% 55.4% 236 6.8% 44.6% O. sativa (1) 441 11.6% 65.9% 228 6.0% 34.1% S. bicolor (3) 324 8.9% 54.7% 268 7.3% 45.3%
46
Table S13: Characteristics of the QTL studies used in the metaQTL analysis.
Publication Population Population
type Population
size #
markers # QTL Groos et al. 2003 (108) Renan / Recital SSD F7 194 14 17 An et al. 2006 (109) Hanxuan10 / Lumai14 DH 120 29 2 Habash et al. 2007 (110) Chinese Spring / SQ1 DH 91 51 5 Laperche et al. 2007 (111) Arche / Recital DH 220 7 13 Li et al 2007 (112)* Chuan 35050 / Shannong 483 SSD F14 131 32 6 Fontaine et al. 2009 (113) Arche / Recital DH 137 / 166
/ 221 16 8
Zhang et al. 2011 (114) PH82-2 / Neixiang 188 SSD F6 240 7 3
Bennett et al. 2012 (115) Kukri / RAC875 DH 260 / 180 41 10 Bogard et al. 2011 (116) Toisondor / CF9107 DH 140 32 11 Guo et al. 2012 (117) Chuan 35050 / Shannong 483 SSD F16 131 45 14 Bogard et al. 2013 (118) Toisondor / Quebon
CF9107 / Quebon Toisondor / CF9107
DH DH DH
140 91 90
46 15
Liu et al. 2013 (119) Hanxuan10 x Lumai14 DH 150 29 11 J. Le Gouis et al. unpub. Apache / Ornicar DH 235 3 5 J. Le Gouis et al. unpub. Apache / Isengrain
CF9107 / Apache CF9107 / Isengrain
DH DH DH
83 161 83
21 1
SSD = Single Seed Descent, DH = Doubled-Haploid
*Li et al. (2007) QTLs were first projected on Guo et al. (117) map removing three
inconsistent markers (Xtrap4c, Xwmc291, Xwmc3) from the Li et al. map (112). The
resulting map and QTLs were used in the meta-analysis.
47
Table S14: Correspondence between 18 metaQTLs, the 3B sequence, the number of annotated genes and the number of markers. The numbers of
ISBP and SSR markers (designed in-silico on the 3B sequence) are indicated only for the 5 metaQTL showing a confidence interval smaller than
10 Mb.
Position of Confidence Interval on 3B consensus map (cM)
Position of Confidence Interval on 3B pseudomolecule (Mb)
Publication metaQTL ID MetaQTL position (cM)
start end size start end size no. Genes
no. ISBP markers
no. SSR markers
Griffiths et al. 2009 (120) 3B-1 12.97 1.60 24.34 22.74 4.1 41.3 37.2 809 Griffiths et al. 2009 (120) 3B-2 50.47 44.80 56.14 11.34 256.2 632.3 376.1 2,372 Mao et al. 2010 (121) P9 10.80 7.97 13.63 5.66 21.4 28.2 6.7 128 1,505 719 Mao et al. 2010 (121) F5 2.71 2.54 2.88 0.34 5.2 9.4 4.2 87 1,188 459 Mao et al. 2010 (121) F6 25.83 13.20 38.45 25.25 26.1 66.4 40.3 650 Zhang et al. 2010 (122) MQTL24 1.22 -0.53 2.96 3.49 0.0 9.6 9.6 266 2,754 1,295 Zhang et al. 2010 (122) MQTL25 30.22 17.91 42.52 24.61 34.7 135.9 101.2 1,152 Zhang et al. 2010 (122) MQTL26 44.80 44.80 44.80 0.00 223.5 388.3 164.8 690 Zhang et al. 2010 (122) MQTL27 23.03 1.26 44.80 43.54 1.9 388.3 386.4 3,223 Zhang et al. 2010 (122) MQTL28 66.85 42.07 91.63 49.56 116.8 737.0 620.3 4,713 Zhang et al. 2010 (122) MQTL29 105.31 106.60 104.02 -2.58 753.0 758.9 5.9 115 1,319 498 Quraishi et al. 2011 (123) MQTL-3 57.10 44.80 69.40 24.60 223.5 712.3 488.7 3,524 Griffiths et al. 2010 (124) QTL_height_3B_1 75.70 46.21 77.81 31.60 429.7 725.3 295.6 2,711 This study MQTL3B-1 7.71 3.93 11.50 7.57 14.2 28.2 14.0 300 This study MQTL3B-2 26.43 20.48 32.39 11.91 36.0 58.6 22.7 391 This study MQTL3B-3 43.95 43.05 44.86 1.81 150.8 398.1 247.3 1,303 This study MQTL3B-4 50.84 49.18 52.50 3.32 535.9 576.3 40.4 376 This study MQTL3B-5 79.16 77.66 80.66 3.00 725.1 726.6 1.5 23 351 160
48
Supplementary figures
Figure S1: Distribution of the density of the CRWs (Cereba) and Quinta LTR
retrotransposons along the pseudomolecule of chromosome 3B. The percentage of the
elements was calculated in sliding windows (length 10 Mb, step 1 Mb). Cereba (blue), Quinta
subvariant A (green), Quinta subvariant B (red).
49
Figure S2: Distribution of the recombination and the linkage disequilibrium along the
chromosome 3B pseudomolecule. Recombination rate (cM/Mb) was calculated using a sliding
window of 1 Mb and is represented in red; linkage disequilibrium is represented in blue. The
left axis represent the value of recombination while the right axis represents the LD values.
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
2
0
2
4
6
8
10
12
14
0 100000000 200000000 300000000 400000000 500000000 600000000 700000000
50
Figure S3: Distribution of the percentage of genes expressed in the different numbers of
experimental conditions (15 conditions tested). The expression of each of the 7,264 predicted
genes was analyzed and each gene was categorized as not expressed (0 conditions) or
expressed in 1 to 15 conditions.
51
A.
B.
Figure S4: Distribution and segmentation analysis of the expression breadth in the
barley and maize genomes.
52
A. Distribution of the average expression breadth along the 7 barley chromosomes (8
conditions from Mayer et al. (34)). B. Distribution of the average expression breadth along the
10 maize chromosomes (18 conditions described in Sekhon et al. (35)). The average
expression breadth was calculated in a sliding window of 10 Mb using a 1 Mb step.
53
A
B
Figure S5: Estimation of the insertion time (in million years) for the complete LTR-RTs
annotated on the chromosome 3B sequence. A) Distribution of the insertion dates estimated
for the complete set (21,619 elements). B) Examples of individual profile of insertion time for
54
4 families displaying contrasted patterns. The date of the maximal density of amplification is
indicated by the red number and line.
55
Figure S6: Distribution of the number of TEs from the three major superfamilies along the
3B pseudomolecule. Window size: 10 Mb; Step: 1 Mb. The gray highlighted areas represent
the R1 and R3 regions, whereas the hatched area represents the centromeric-pericentromeric
region. Gypsy (blue), Copia (green), CACTA (red)
56
Figure S7: TE distribution around coding sequences for the 10 most representative families
of transposable elements. Left graph: Gypsy (blue), copia (green), CACTA (red), unclassified
LTR retrotransposons (brown). Right graph: Unclassified TEs (light green), LINE (light blue),
Harbinger (magenta), unclassified DNA transposons with TIRs (orange), Mutator (purple),
Mariner (cyan). The gene is considered as a single point at position 0 and the average density
of TE was estimated in a window of 20 kb upstream and downstream of each of the gene
models.
57
Figure S8: Pairwise collinearity analyses between the genes located on orthologous
chromosomes in four grass species: wheat chromosome 3B (Ta3B), rice chromosome 1 (Os1),
Brachypodium chromosome 2 (Bd2), and sorghum chromosome 3 (Sb3). Each line between
chromosomes represents a block of collinearity comprised of at least 5 genes. The number of
collinear blocks followed by the total number of collinear genes is indicated within each
pairwise comparative circle.
7; 2489 12; 2304 10; 2372
117; 1532128; 1755124; 1603
58
Figure S9: Expression breadth of syntenic, nonsyntenic, intra- and inter-chromosomally
duplicated genes. Graphs represent the proportions of syntenic genes (black line), nonsyntenic
genes (red line), intra-chromosomal duplicates (blue line) and single copy genes (green line)
expressed in 1 to 15 different experimental conditions tested.
59
Figure S10: Distribution among the 6 wheat chromosome groups of putative ancestral
loci for 152 nonsyntenic genes of chromosome 3B for which a parental copy was
determined unambiguously. The expected proportion of genes was found using all the
genes from the annotation of the IWGSC chromosome survey contigs (19) except group 3.
The Chi squared equals 8.606 with 5 degrees of freedom (p= 0.1258), indicating there is
no significant difference from the expected distribution.
60
Figure S11: Relative abundance of TE superfamilies associated with syntenic and
nonsyntenic genes. For each of the major TE superfamilies (according to the 3-letter code
defined in Wicker et al. (125), the enrichment in TEs found in the 20 kb upstream and
downstream of the nonsyntenic genes was calculated based on the average proportion
observed around syntenic genes. Positive values indicate overrepresentation of TEs around
nonsyntenic compared to syntenic genes, and inversely. Only superfamilies representing more
than 0.1% of the 3B sequence were indicated in this histogram. Enrichment proportions (in %)
are indicated at the top of the histogram.
61
References 1. J. Dubcovsky, J. Dvorak, Genome plasticity a key factor in the success of polyploid wheat
under domestication. Science 316, 1862–1866 (2007). Medline doi:10.1126/science.1143986
2. F. Choulet, T. Wicker, C. Rustenholz, E. Paux, J. Salse, P. Leroy, S. Schlub, M. C. Le Paslier, G. Magdelenat, C. Gonthier, A. Couloux, H. Budak, J. Breen, M. Pumphrey, S. Liu, X. Kong, J. Jia, M. Gut, D. Brunel, J. A. Anderson, B. S. Gill, R. Appels, B. Keller, C. Feuillet, Megabase level sequencing reveals contrasted organization and evolution patterns of the wheat gene and transposable element spaces. Plant Cell 22, 1686–1701 (2010). Medline doi:10.1105/tpc.110.074187
3. J. Zhang, Evolution by gene duplication: An update. Trends Ecol. Evol. 18, 292–298 (2003). doi:10.1016/S0169-5347(03)00033-8
4. C. Rustenholz, F. Choulet, C. Laugier, J. Safár, H. Simková, J. Dolezel, F. Magni, S. Scalabrin, F. Cattonaro, S. Vautrin, A. Bellec, H. Bergès, C. Feuillet, E. Paux, A 3,000-loci transcription map of chromosome 3B unravels the structural and functional features of gene islands in hexaploid wheat. Plant Physiol. 157, 1596–1608 (2011). Medline doi:10.1104/pp.111.183921
5. I. D. Wilson, G. L. Barker, R. W. Beswick, S. K. Shepherd, C. Lu, J. A. Coghill, D. Edwards, P. Owen, R. Lyons, J. S. Parker, J. R. Lenton, M. J. Holdsworth, P. R. Shewry, K. J. Edwards, A transcriptomics resource for wheat functional genomics. Plant Biotechnol. J. 2, 495–506 (2004). Medline doi:10.1111/j.1467-7652.2004.00096.x
6. P. R. Bhat, A. Lukaszewski, X. Cui, J. Xu, J. T. Svensson, S. Wanamaker, J. G. Waines, T. J. Close, Mapping translocation breakpoints using a wheat microarray. Nucleic Acids Res. 35, 2936–2943 (2007). Medline doi:10.1093/nar/gkm148
7. L. L. Qi, B. Echalier, S. Chao, G. R. Lazo, G. E. Butler, O. D. Anderson, E. D. Akhunov, J. Dvorák, A. M. Linkiewicz, A. Ratnasiri, J. Dubcovsky, C. E. Bermudez-Kandianis, R. A. Greene, R. Kantety, C. M. La Rota, J. D. Munkvold, S. F. Sorrells, M. E. Sorrells, M. Dilbirligi, D. Sidhu, M. Erayman, H. S. Randhawa, D. Sandhu, S. N. Bondareva, K. S. Gill, A. A. Mahmoud, X. F. Ma, J. P. Miftahudin, E. J. Gustafson, V. Conley, J. L. Nduati, J. A. Gonzalez-Hernandez, J. H. Anderson, N. L. Peng, K. G. Lapitan, V. Hossain, S. F. Kalavacharla, M. S. Kianian, D. S. Pathan, H. T. Zhang, D. W. Nguyen, R. D. Choi, T. J. Fenton, P. E. Close, C. O. McGuire, B. S. Qualset, Gill, A chromosome bin map of 16,000 expressed sequence tag loci and distribution of genes among the three genomes of polyploid wheat. Genetics 168, 701–712 (2004). Medline doi:10.1534/genetics.104.034868
8. R. Brenchley, M. Spannagl, M. Pfeifer, G. L. Barker, R. D’Amore, A. M. Allen, N. McKenzie, M. Kramer, A. Kerhornou, D. Bolser, S. Kay, D. Waite, M. Trick, I. Bancroft, Y. Gu, N. Huo, M. C. Luo, S. Sehgal, B. Gill, S. Kianian, O. Anderson, P. Kersey, J. Dvorak, W. R. McCombie, A. Hall, K. F. Mayer, K. J. Edwards, M. W. Bevan, N. Hall, Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature 491, 705–710 (2012). Medline doi:10.1038/nature11650
62
9. J. Jia, S. Zhao, X. Kong, Y. Li, G. Zhao, W. He, R. Appels, M. Pfeifer, Y. Tao, X. Zhang, R. Jing, C. Zhang, Y. Ma, L. Gao, C. Gao, M. Spannagl, K. F. Mayer, D. Li, S. Pan, F. Zheng, Q. Hu, X. Xia, J. Li, Q. Liang, J. Chen, T. Wicker, C. Gou, H. Kuang, G. He, Y. Luo, B. Keller, Q. Xia, P. Lu, J. Wang, H. Zou, R. Zhang, J. Xu, J. Gao, C. Middleton, Z. Quan, G. Liu, J. Wang, H. Yang, X. Liu, Z. He, L. Mao, J. Wang; International Wheat Genome Sequencing Consortium, Aegilops tauschii draft genome sequence reveals a gene repertoire for wheat adaptation. Nature 496, 91–95 (2013). Medline doi:10.1038/nature12028
10. H. Q. Ling, S. Zhao, D. Liu, J. Wang, H. Sun, C. Zhang, H. Fan, D. Li, L. Dong, Y. Tao, C. Gao, H. Wu, Y. Li, Y. Cui, X. Guo, S. Zheng, B. Wang, K. Yu, Q. Liang, W. Yang, X. Lou, J. Chen, M. Feng, J. Jian, X. Zhang, G. Luo, Y. Jiang, J. Liu, Z. Wang, Y. Sha, B. Zhang, H. Wu, D. Tang, Q. Shen, P. Xue, S. Zou, X. Wang, X. Liu, F. Wang, Y. Yang, X. An, Z. Dong, K. Zhang, X. Zhang, M. C. Luo, J. Dvorak, Y. Tong, J. Wang, H. Yang, Z. Li, D. Wang, A. Zhang, J. Wang, Draft genome of the wheat A-genome progenitor Triticum urartu. Nature 496, 87–90 (2013). Medline doi:10.1038/nature11997
11. M. C. Schatz, A. L. Delcher, S. L. Salzberg, Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010). Medline doi:10.1101/gr.101360.109
12. V. Marx, Next-generation sequencing: The genome jigsaw. Nature 501, 263–268 (2013). Medline doi:10.1038/501261a
13. P. S. Schnable, D. Ware, R. S. Fulton, J. C. Stein, F. Wei, S. Pasternak, C. Liang, J. Zhang, L. Fulton, T. A. Graves, P. Minx, A. D. Reily, L. Courtney, S. S. Kruchowski, C. Tomlinson, C. Strong, K. Delehaunty, C. Fronick, B. Courtney, S. M. Rock, E. Belter, F. Du, K. Kim, R. M. Abbott, M. Cotton, A. Levy, P. Marchetto, K. Ochoa, S. M. Jackson, B. Gillam, W. Chen, L. Yan, J. Higginbotham, M. Cardenas, J. Waligorski, E. Applebaum, L. Phelps, J. Falcone, K. Kanchi, T. Thane, A. Scimone, N. Thane, J. Henke, T. Wang, J. Ruppert, N. Shah, K. Rotter, J. Hodges, E. Ingenthron, M. Cordes, S. Kohlberg, J. Sgro, B. Delgado, K. Mead, A. Chinwalla, S. Leonard, K. Crouse, K. Collura, D. Kudrna, J. Currie, R. He, A. Angelova, S. Rajasekar, T. Mueller, R. Lomeli, G. Scara, A. Ko, K. Delaney, M. Wissotski, G. Lopez, D. Campos, M. Braidotti, E. Ashley, W. Golser, H. Kim, S. Lee, J. Lin, Z. Dujmic, W. Kim, J. Talag, A. Zuccolo, C. Fan, A. Sebastian, M. Kramer, L. Spiegel, L. Nascimento, T. Zutavern, B. Miller, C. Ambroise, S. Muller, W. Spooner, A. Narechania, L. Ren, S. Wei, S. Kumari, B. Faga, M. J. Levy, L. McMahan, P. Van Buren, M. W. Vaughn, K. Ying, C. T. Yeh, S. J. Emrich, Y. Jia, A. Kalyanaraman, A. P. Hsia, W. B. Barbazuk, R. S. Baucom, T. P. Brutnell, N. C. Carpita, C. Chaparro, J. M. Chia, J. M. Deragon, J. C. Estill, Y. Fu, J. A. Jeddeloh, Y. Han, H. Lee, P. Li, D. R. Lisch, S. Liu, Z. Liu, D. H. Nagel, M. C. McCann, P. SanMiguel, A. M. Myers, D. Nettleton, J. Nguyen, B. W. Penning, L. Ponnala, K. L. Schneider, D. C. Schwartz, A. Sharma, C. Soderlund, N. M. Springer, Q. Sun, H. Wang, M. Waterman, R. Westerman, T. K. Wolfgruber, L. Yang, Y. Yu, L. Zhang, S. Zhou, Q. Zhu, J. L. Bennetzen, R. K. Dawe, J. Jiang, N. Jiang, G. G. Presting, S. R. Wessler, S. Aluru, R. A. Martienssen, S. W. Clifton, W. R. McCombie, R. A. Wing, R. K. Wilson, The B73 maize genome: Complexity, diversity, and dynamics. Science 326, 1112–1115 (2009). Medline doi:10.1126/science.1178534
63
14. X. Xu, S. Pan, S. Cheng, B. Zhang, D. Mu, P. Ni, G. Zhang, S. Yang, R. Li, J. Wang, G. Orjeda, F. Guzman, M. Torres, R. Lozano, O. Ponce, D. Martinez, G. De la Cruz, S. K. Chakrabarti, V. U. Patil, K. G. Skryabin, B. B. Kuznetsov, N. V. Ravin, T. V. Kolganova, A. V. Beletsky, A. V. Mardanov, A. Di Genova, D. M. Bolser, D. M. Martin, G. Li, Y. Yang, H. Kuang, Q. Hu, X. Xiong, G. J. Bishop, B. Sagredo, N. Mejía, W. Zagorski, R. Gromadka, J. Gawor, P. Szczesny, S. Huang, Z. Zhang, C. Liang, J. He, Y. Li, Y. He, J. Xu, Y. Zhang, B. Xie, Y. Du, D. Qu, M. Bonierbale, M. Ghislain, M. R. Herrera, G. Giuliano, M. Pietrella, G. Perrotta, P. Facella, K. O’Brien, S. E. Feingold, L. E. Barreiro, G. A. Massa, L. Diambra, B. R. Whitty, B. Vaillancourt, H. Lin, A. N. Massa, M. Geoffroy, S. Lundback, D. DellaPenna, C. R. Buell, S. K. Sharma, D. F. Marshall, R. Waugh, G. J. Bryan, M. Destefanis, I. Nagy, D. Milbourne, S. J. Thomson, M. Fiers, J. M. Jacobs, K. L. Nielsen, M. Sønderkær, M. Iovene, G. A. Torres, J. Jiang, R. E. Veilleux, C. W. Bachem, J. de Boer, T. Borm, B. Kloosterman, H. van Eck, E. Datema, B. Hekkert, A. Goverse, R. C. van Ham, R. G. Visser; Potato Genome Sequencing Consortium, Genome sequence and analysis of the tuber crop potato. Nature 475, 189–195 (2011). Medline doi:10.1038/nature10158
15. J. Doležel, M. Kubaláková, E. Paux, J. Bartos, C. Feuillet, Chromosome-based genomics in the cereals. Chromosome Res. 15, 51–66 (2007). Medline doi:10.1007/s10577-006-1106-x
16. J. Safár, J. Bartos, J. Janda, A. Bellec, M. Kubaláková, M. Valárik, S. Pateyron, J. Weiserová, R. Tusková, J. Cíhalíková, J. Vrána, H. Simková, P. Faivre-Rampant, P. Sourdille, M. Caboche, M. Bernard, J. Dolezel, B. Chalhoub, Dissecting large and complex genomes: Flow sorting and BAC cloning of individual chromosomes from bread wheat. Plant J. 39, 960–968 (2004). Medline doi:10.1111/j.1365-313X.2004.02179.x
17. E. Paux, P. Sourdille, J. Salse, C. Saintenac, F. Choulet, P. Leroy, A. Korol, M. Michalak, S. Kianian, W. Spielmeyer, E. Lagudah, D. Somers, A. Kilian, M. Alaux, S. Vautrin, H. Bergès, K. Eversole, R. Appels, J. Safar, H. Simkova, J. Dolezel, M. Bernard, C. Feuillet, A physical map of the 1-gigabase bread wheat chromosome 3B. Science 322, 101–104 (2008). Medline doi:10.1126/science.1161847
18. Supplementary materials are available on Science Online.
19. International Wheat Genome Sequencing Consortium, A chromosome-based draft sequence of the hexaploid (Triticum aestivum) bread wheat genome. Science 345, 1251788 (2014). doi:10.1126/science.1251788
20. B. S. Gill, B. Friebe, T. Endo, Standard karyotype and nomenclature system for description of chromosome bands and structural aberrations in wheat (Triticum aestivum). Genome 34, 830–839 (1991). doi:10.1139/g91-128
21. P. S. Chain, D. V. Grafham, R. S. Fulton, M. G. Fitzgerald, J. Hostetler, D. Muzny, J. Ali, B. Birren, D. C. Bruce, C. Buhay, J. R. Cole, Y. Ding, S. Dugan, D. Field, G. M. Garrity, R. Gibbs, T. Graves, C. S. Han, S. H. Harrison, S. Highlander, P. Hugenholtz, H. M. Khouri, C. D. Kodira, E. Kolker, N. C. Kyrpides, D. Lang, A. Lapidus, S. A. Malfatti, V. Markowitz, T. Metha, K. E. Nelson, J. Parkhill, S. Pitluck, X. Qin, T. D. Read, J. Schmutz, S. Sozhamannan, P. Sterk, R. L. Strausberg, G. Sutton, N. R. Thomson, J. M. Tiedje, G. Weinstock, A. Wollam, J. C. Detter; Genomic Standards Consortium Human
64
Microbiome Project Jumpstart Consortium, Genomics. Genome project standards in a new era of sequencing. Science 326, 236–237 (2009). Medline doi:10.1126/science.1180614
22. B. Li, F. Choulet, Y. Heng, W. Hao, E. Paux, Z. Liu, W. Yue, W. Jin, C. Feuillet, X. Zhang, Wheat centromeric retrotransposons: The new ones take a major role in centromeric structure. Plant J. 73, 952–965 (2013). Medline doi:10.1111/tpj.12086
23. H. Yan, P. B. Talbert, H. R. Lee, J. Jett, S. Henikoff, F. Chen, J. Jiang, Intergenic locations of rice centromeric chromatin. PLOS Biol. 6, e286 (2008). Medline doi:10.1371/journal.pbio.0060286
24. H. Zhang, R. K. Dawe, Total centromere size and genome size are strongly correlated in ten grass species. Chromosome Res. 20, 403–412 (2012). Medline doi:10.1007/s10577-012-9284-1
25. T. Sakuno, K. Tada, Y. Watanabe, Kinetochore geometry defined by cohesion within the centromere. Nature 458, 852–858 (2009). Medline doi:10.1038/nature07876
26. Q. Pan, F. Ali, X. Yang, J. Li, J. Yan, Exploring the genetic characteristics of two recombinant inbred line populations via high-density SNP markers in maize. PLOS ONE 7, e52777 (2012). Medline doi:10.1371/journal.pone.0052777
27. A. J. Lukaszewski, C. A. Curtis, Physical distribution of recombination in B-genome chromosomes of tetraploid wheat. Theor. Appl. Genet. 86, 121–127 (1993). Medline doi:10.1007/BF00223816
28. C. Saintenac, M. Falque, O. C. Martin, E. Paux, C. Feuillet, P. Sourdille, Detailed recombination studies along chromosome 3B provide new insights on crossover distribution in wheat (Triticum aestivum L.). Genetics 181, 393–403 (2009). Medline doi:10.1534/genetics.108.097469
29. J. Evans, R. F. McCormick, D. Morishige, S. N. Olson, B. Weers, J. Hilley, P. Klein, W. Rooney, J. Mullet, Extensive variation in the density and distribution of DNA polymorphism in sorghum genomes. PLOS ONE 8, e79192 (2013). Medline doi:10.1371/journal.pone.0079192
30. International Rice Genome Sequencing Project, The map-based sequence of the rice genome. Nature 436, 793–800 (2005). Medline doi:10.1038/nature03895
31. A. H. Paterson, J. E. Bowers, R. Bruggmann, I. Dubchak, J. Grimwood, H. Gundlach, G. Haberer, U. Hellsten, T. Mitros, A. Poliakov, J. Schmutz, M. Spannagl, H. Tang, X. Wang, T. Wicker, A. K. Bharti, J. Chapman, F. A. Feltus, U. Gowik, I. V. Grigoriev, E. Lyons, C. A. Maher, M. Martis, A. Narechania, R. P. Otillar, B. W. Penning, A. A. Salamov, Y. Wang, L. Zhang, N. C. Carpita, M. Freeling, A. R. Gingle, C. T. Hash, B. Keller, P. Klein, S. Kresovich, M. C. McCann, R. Ming, D. G. Peterson, D. Mehboob-ur-Rahman, P. Ware, K. F. Westhoff, J. Mayer, D. S. Messing, Rokhsar, The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551–556 (2009). Medline doi:10.1038/nature07723
32. A. Gottlieb, H. G. Müller, A. N. Massa, H. Wanjugi, K. R. Deal, F. M. You, X. Xu, Y. Q. Gu, M. C. Luo, O. D. Anderson, A. P. Chan, P. Rabinowicz, K. M. Devos, J. Dvorak,
65
Insular organization of gene space in grass genomes. PLOS ONE 8, e54101 (2013). Medline doi:10.1371/journal.pone.0054101
33. M. W. Ganal, G. Durstewitz, A. Polley, A. Bérard, E. S. Buckler, A. Charcosset, J. D. Clarke, E. M. Graner, M. Hansen, J. Joets, M. C. Le Paslier, M. D. McMullen, P. Montalent, M. Rose, C. C. Schön, Q. Sun, H. Walter, O. C. Martin, M. Falque, A large maize (Zea mays L.) SNP genotyping array: Development and germplasm genotyping, and genetic mapping to compare with the B73 reference genome. PLOS ONE 6, e28334 (2011). Medline doi:10.1371/journal.pone.0028334
34. K. F. Mayer, R. Waugh, J. W. Brown, A. Schulman, P. Langridge, M. Platzer, G. B. Fincher, G. J. Muehlbauer, K. Sato, T. J. Close, R. P. Wise, N. Stein; International Barley Genome Sequencing Consortium, A physical, genetic and functional sequence assembly of the barley genome. Nature 491, 711–716 (2012). Medline
35. R. S. Sekhon, R. Briskine, C. N. Hirsch, C. L. Myers, N. M. Springer, C. R. Buell, N. de Leon, S. M. Kaeppler, Maize gene atlas developed by RNA sequencing and comparative evaluation of transcriptomes based on RNA sequencing and microarrays. PLOS ONE 8, e61005 (2013). Medline doi:10.1371/journal.pone.0061005
36. R. S. Baucom, J. C. Estill, C. Chaparro, N. Upshaw, A. Jogi, J. M. Deragon, R. P. Westerman, P. J. Sanmiguel, J. L. Bennetzen, Exceptional diversity, non-random distribution, and rapid evolution of retroelements in the B73 maize genome. PLOS Genet. 5, e1000732 (2009). Medline doi:10.1371/journal.pgen.1000732
37. R. S. Baucom, J. C. Estill, J. Leebens-Mack, J. L. Bennetzen, Natural selection on gene function drives the evolution of LTR retrotransposon families in the rice genome. Genome Res. 19, 243–254 (2009). Medline doi:10.1101/gr.083360.108
38. M. Charles, H. Belcram, J. Just, C. Huneau, A. Viollet, A. Couloux, B. Segurens, M. Carter, V. Huteau, O. Coriton, R. Appels, S. Samain, B. Chalhoub, Dynamics and differential proliferation of transposable elements during the evolution of the B and A genomes of wheat. Genetics 180, 1071–1086 (2008). Medline doi:10.1534/genetics.108.092304
39. E. M. Sergeeva, E. A. Salina, I. G. Adonina, B. Chalhoub, Evolutionary analysis of the CACTA DNA-transposon Caspar across wheat species using sequence comparison and in situ hybridization. Mol. Genet. Genomics 284, 11–23 (2010). Medline doi:10.1007/s00438-010-0544-5
40. C. Lu, J. Chen, Y. Zhang, Q. Hu, W. Su, H. Kuang, Miniature inverted-repeat transposable elements (MITEs) have been accumulated through amplification bursts and play important roles in gene expression and species diversity in Oryza sativa. Mol. Biol. Evol. 29, 1005–1017 (2012). Medline doi:10.1093/molbev/msr282
41. M. D. Gale, K. M. Devos, Comparative genetics in the grasses. Proc. Natl. Acad. Sci. U.S.A. 95, 1971–1974 (1998). Medline doi:10.1073/pnas.95.5.1971
42. K. M. Devos, M. D. Gale, Genome relationships: The grass model in current research. Plant Cell 12, 637–646 (2000). Medline doi:10.1105/tpc.12.5.637
43. F. Murat, J. H. Xu, E. Tannier, M. Abrouk, N. Guilhot, C. Pont, J. Messing, J. Salse, Ancestral grass karyotype reconstruction unravels new mechanisms of genome shuffling
66
as a source of plant evolution. Genome Res. 20, 1545–1557 (2010). Medline doi:10.1101/gr.109744.110
44. International Brachypodium Initiative, Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature 463, 763–768 (2010). Medline doi:10.1038/nature08747
45. E. D. Akhunov, A. R. Akhunova, A. M. Linkiewicz, J. Dubcovsky, D. Hummel, G. R. Lazo, S. Chao, O. D. Anderson, J. David, L. Qi, B. Echalier, B. S. Gill, J. P. Miftahudin, M. Gustafson, M. E. La Rota, D. Sorrells, H. T. Zhang, V. Nguyen, K. Kalavacharla, S. F. Hossain, J. Kianian, N. L. Peng, E. J. Lapitan, V. Wennerlind, J. A. Nduati, D. Anderson, K. S. Sidhu, P. E. Gill, C. O. McGuire, J. Qualset, J. Dvorak, Synteny perturbations between wheat homoeologous chromosomes caused by locus duplications and deletions correlate with recombination rates. Proc. Natl. Acad. Sci. U.S.A. 100, 10836–10841 (2003). Medline doi:10.1073/pnas.1934431100
46. T. Wicker, K. F. Mayer, H. Gundlach, M. Martis, B. Steuernagel, U. Scholz, H. Simková, M. Kubaláková, F. Choulet, S. Taudien, M. Platzer, C. Feuillet, T. Fahima, H. Budak, J. Dolezel, B. Keller, N. Stein, Frequent gene movement and pseudogene evolution is common to the large and complex genomes of wheat, barley, and their relatives. Plant Cell 23, 1706–1718 (2011). Medline doi:10.1105/tpc.111.086629
47. T. Wicker, J. P. Buchmann, B. Keller, Patching gaps in plant genomes results in gene movement and erosion of colinearity. Genome Res. 20, 1229–1237 (2010). Medline doi:10.1101/gr.107284.110
48. M. Morgante, S. Brunner, G. Pea, K. Fengler, A. Zuccolo, A. Rafalski, Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize. Nat. Genet. 37, 997–1002 (2005). Medline doi:10.1038/ng1615
49. C. Feuillet, J. E. Leach, J. Rogers, P. S. Schnable, K. Eversole, Crop genome sequencing: Lessons and rationales. Trends Plant Sci. 16, 77–88 (2011). Medline doi:10.1016/j.tplants.2010.10.005
50. E. Paux, S. Faure, F. Choulet, D. Roger, V. Gauthier, J. P. Martinant, P. Sourdille, F. Balfourier, M. C. Le Paslier, A. Chauveau, M. Cakir, B. Gandon, C. Feuillet, Insertion site-based polymorphism markers open new perspectives for genome saturation and marker-assisted selection in wheat. Plant Biotechnol. J. 8, 196–210 (2010). Medline doi:10.1111/j.1467-7652.2009.00477.x
51. B. Goffinet, S. Gerber, Quantitative trait loci: A meta-analysis. Genetics 155, 463–473 (2000). Medline
52. J. A. Foley, N. Ramankutty, K. A. Brauman, E. S. Cassidy, J. S. Gerber, M. Johnston, N. D. Mueller, C. O’Connell, D. K. Ray, P. C. West, C. Balzer, E. M. Bennett, S. R. Carpenter, J. Hill, C. Monfreda, S. Polasky, J. Rockström, J. Sheehan, S. Siebert, D. Tilman, D. P. Zaks, Solutions for a cultivated planet. Nature 478, 337–342 (2011). Medline doi:10.1038/nature10452
53. P. Leroy, N. Guilhot, H. Sakai, A. Bernard, F. Choulet, S. Theil, S. Reboux, N. Amano, T. Flutre, C. Pelegrin, H. Ohyanagi, M. Seidel, F. Giacomoni, M. Reichstadt, M. Alaux, E. Gicquello, F. Legeai, L. Cerutti, H. Numa, T. Tanaka, K. Mayer, T. Itoh, H. Quesneville,
67
C. Feuillet, TriAnnot: A versatile and high performance pipeline for the automated annotation of plant genomes. Front. Plant Sci. 3, 5 (2012). Medline doi:10.3389/fpls.2012.00005
54. J. Chen, A. K. Gupta, Parametric Statistical Change Point Analysis: With Applications to Genetics, Medicine, and Finance (Birkhäuser, Basel, 2012).
55. L. Li, C. J. Stoeckert Jr., D. S. Roos, OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003). Medline doi:10.1101/gr.1224503
56. J. D. Thompson, D. G. Higgins, T. J. Gibson, CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994). Medline doi:10.1093/nar/22.22.4673
57. Z. Yang, PAML: A program package for phylogenetic analysis by maximum likelihood. CABIOS 13, 555–556 (1997). Medline
58. H. Šimková, J. T. Svensson, P. Condamine, E. Hribová, P. Suchánková, P. R. Bhat, J. Bartos, J. Safár, T. J. Close, J. Dolezel, Coupling amplified DNA from flow-sorted chromosomes to high-density SNP mapping in barley. BMC Genomics 9, 294 (2008). Medline doi:10.1186/1471-2164-9-294
59. R. Li, C. Yu, Y. Li, T. W. Lam, S. M. Yiu, K. Kristiansen, J. Wang, SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009). Medline doi:10.1093/bioinformatics/btp336
60. S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997). Medline doi:10.1093/nar/25.17.3389
61. J. M. Aury, C. Cruaud, V. Barbe, O. Rogier, S. Mangenot, G. Samson, J. Poulain, V. Anthouard, C. Scarpelli, F. Artiguenave, P. Wincker, High quality draft sequences for prokaryotic genomes using a mix of new sequencing technologies. BMC Genomics 9, 603 (2008). Medline doi:10.1186/1471-2164-9-603
62. R. Philippe, F. Choulet, E. Paux, J. van Oeveren, J. Tang, A. H. Wittenberg, A. Janssen, M. J. van Eijk, K. Stormo, A. Alberti, P. Wincker, E. Akhunov, E. van der Vossen, C. Feuillet, Whole Genome Profiling provides a robust framework for physical mapping and sequencing in the highly complex and repetitive wheat genome. BMC Genomics 13, 47 (2012). Medline doi:10.1186/1471-2164-13-47
63. W. J. Kent, BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002). Medline doi:10.1101/gr.229202. Article published online before March 2002
64. C. Soderlund, S. Humphray, A. Dunham, L. French, Contigs built with fingerprints, markers, and FPC V4.7. Genome Res. 10, 1772–1787 (2000). Medline doi:10.1101/gr.GR-1375R
65. A. Graner, H. Siedler, A. Jahoor, R. G. Herrmann, G. Wenzel, Assessment of the degree and the type of restriction fragment length polymorphism in barley (Hordeum vulgare). Theor. Appl. Genet. 80, 826–832 (1990). Medline doi:10.1007/BF00224200
68
66. M. Stanke, O. Keller, I. Gunduz, A. Hayes, S. Waack, B. Morgenstern, AUGUSTUS: Ab initio prediction of alternative transcripts. Nucleic Acids Res. 34 (Web Server), W435–W439 (2006). Medline doi:10.1093/nar/gkl200
67. E. Blanco, J. F. Abril, Computational gene annotation in new genome assemblies using GeneID. Methods Mol. Biol. 537, 243–261 (2009). Medline doi:10.1007/978-1-59745-251-9_12
68. K. Rutherford, J. Parkhill, J. Crook, T. Horsnell, P. Rice, M. A. Rajandream, B. Barrell, Artemis: Sequence visualization and annotation. Bioinformatics 16, 944–945 (2000). Medline doi:10.1093/bioinformatics/16.10.944
69. T. Schiex, A. Moisan, P. Rouze, in Computational Biology, O. Gascuel, M.-F. Sagot, Eds., LNCS 2066 (Springer-Verlag, Berline Heidelberg, 2001), pp. 111–125.
70. G. S. Slater, E. Birney, Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005). Medline doi:10.1186/1471-2105-6-31
71. M. Van Bel, S. Proost, E. Wischnitzki, S. Movahedi, C. Scheerlinck, Y. Van de Peer, K. Vandepoele, Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol. 158, 590–600 (2012). Medline doi:10.1104/pp.111.189514
72. W. Lin, Y. Chen, J. Ho, C. Hsiao, GOBU: Toward an integration interface for biological object. J. Information Sci. Eng. 22, 19–30 (2006); http://www.iis.sinica.edu.tw/papers/hoho/5342-F.pdf
73. S. G. Jantzen, B. J. Sutherland, D. R. Minkley, B. F. Koop, GO trimming: Systematically reducing redundancy in large Gene Ontology datasets. BMC Res. Notes 4, 267 (2011). Medline doi:10.1186/1756-0500-4-267
74. G. van Ooijen, G. Mayr, M. M. Kasiem, M. Albrecht, B. J. Cornelissen, F. L. Takken, Structure-function analysis of the NB-ARC domain of plant disease resistance proteins. J. Exp. Bot. 59, 1383–1397 (2008). Medline doi:10.1093/jxb/ern045
75. K. Lagesen, P. Hallin, E. A. Rødland, H. H. Staerfeldt, T. Rognes, D. W. Ussery, RNAmmer: Consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108 (2007). Medline doi:10.1093/nar/gkm160
76. T. M. Lowe, S. R. Eddy, tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997). Medline doi:10.1093/nar/25.5.0955
77. S. Connelly, C. Marshallsay, D. Leader, J. W. Brown, W. Filipowicz, Small nuclear RNA genes transcribed by either RNA polymerase II or RNA polymerase III in monocot plants share three promoter elements and use a strategy to regulate gene expression different from that used by their dicot plant counterparts. Mol. Cell. Biol. 14, 5910–5919 (1994). Medline doi:10.1128/MCB.14.9.5910
78. A. J. Enright, S. Van Dongen, C. A. Ouzounis, An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002). Medline doi:10.1093/nar/30.7.1575
69
79. K. Katoh, K. Kuma, H. Toh, T. Miyata, MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33, 511–518 (2005). Medline doi:10.1093/nar/gki198
80. T. Flutre, E. Duprat, C. Feuillet, H. Quesneville, Considering transposable element diversification in de novo annotation approaches. PLOS ONE 6, e16526 (2011). Medline doi:10.1371/journal.pone.0016526
81. D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, S. L. Salzberg, TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013). Medline doi:10.1186/gb-2013-14-4-r36
82. B. Langmead, S. L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012). Medline doi:10.1038/nmeth.1923
83. H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin; 1000 Genome Project Data Processing Subgroup, The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). Medline doi:10.1093/bioinformatics/btp352
84. C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, L. Pachter, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010). Medline doi:10.1038/nbt.1621
85. A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, B. Wold, Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008). Medline doi:10.1038/nmeth.1226
86. S. Shen, J. W. Park, J. Huang, K. A. Dittmar, Z. X. Lu, Q. Zhou, R. P. Carstens, Y. Xing, MATS: A Bayesian framework for flexible detection of differential alternative splicing from RNA-Seq data. Nucleic Acids Res. 40, e61 (2012). Medline doi:10.1093/nar/gkr1291
87. T. J. Carver, K. M. Rutherford, M. Berriman, M. A. Rajandream, B. G. Barrell, J. Parkhill, ACT: The Artemis Comparison Tool. Bioinformatics 21, 3422–3423 (2005). Medline doi:10.1093/bioinformatics/bti553
88. Y. Wang, H. Tang, J. D. Debarry, X. Tan, J. Li, X. Wang, T. H. Lee, H. Jin, B. Marler, H. Guo, J. C. Kissinger, A. H. Paterson, MCScanX: A toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49 (2012). Medline doi:10.1093/nar/gkr1293
89. M. Krzywinski, J. Schein, I. Birol, J. Connors, R. Gascoyne, D. Horsman, S. J. Jones, M. A. Marra, Circos: An information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009). Medline doi:10.1101/gr.092759.109
90. E. R. Sears, The aneuploid of common wheat. Mo. Agric. Exp. Sta. Res. Bull. 572, 1–59 (1954).
91. E. R. Sears, L. Sears, in Proc. 5th Int. Wheat Genetics Symp., S. Ramanujams, Ed. (Indian Agricultural Research Institute, New Delhi, India., 1978), pp. 389-407.
70
92. T. R. Endo, B. S. Gill, The deletion stocks of common wheat. J. Hered. 87, 295–307 (1996). doi:10.1093/oxfordjournals.jhered.a023003
93. A. Graner, H. Siedler, A. Jahoor, R. G. Herrmann, G. Wenzel, Assessment of the degree and the type of restriction fragment length polymorphism in barley (Hordeum vulgare). Theor. Appl. Genet. 80, 826–832 (1990). Medline doi:10.1007/BF00224200
94. F. Balfourier, V. Roussel, P. Strelchenko, F. Exbrayat-Vinson, P. Sourdille, G. Boutet, J. Koenig, C. Ravel, O. Mitrofanova, M. Beckert, G. Charmet, A worldwide bread wheat core collection arrayed in a 384-well plate. Theor. Appl. Genet. 114, 1265–1275 (2007). Medline doi:10.1007/s00122-007-0517-1
95. A. Horvath, A. Didier, J. Koenig, F. Exbrayat, G. Charmet, F. Balfourier, Analysis of diversity and linkage disequilibrium along chromosome 3B of bread wheat (Triticum aestivum L.). Theor. Appl. Genet. 119, 1523–1537 (2009). Medline doi:10.1007/s00122-009-1153-8
96. S. de Givry, M. Bouchez, P. Chabrier, D. Milan, T. Schiex, CARHTA GENE: Multipopulation integrated genetic and radiation hybrid mapping. Bioinformatics 21, 1703–1704 (2005). Medline doi:10.1093/bioinformatics/bti222
97. K. C. Cone, M. D. McMullen, I. V. Bi, G. L. Davis, Y. S. Yim, J. M. Gardiner, M. L. Polacco, H. Sanchez-Villeda, Z. Fang, S. G. Schroeder, S. A. Havermann, J. E. Bowers, A. H. Paterson, C. A. Soderlund, F. W. Engler, R. A. Wing, E. H. Coe Jr., Genetic, physical, and informatics resources for maize. On the road to an integrated map. Plant Physiol. 130, 1598–1605 (2002). Medline doi:10.1104/pp.012245
98. P. Wenzl, P. Suchánková, J. Carling, H. Simková, E. Huttner, M. Kubaláková, P. Sourdille, E. Paul, C. Feuillet, A. Kilian, J. Dolezel, Isolated chromosomes as a new and efficient source of DArT markers for the saturation of genetic maps. Theor. Appl. Genet. 121, 465–474 (2010). Medline doi:10.1007/s00122-010-1323-8
99. M. E. Sorrells, J. P. Gustafson, D. Somers, S. Chao, D. Benscher, G. Guedira-Brown, E. Huttner, A. Kilian, P. E. McGuire, K. Ross, J. Tanaka, P. Wenzl, K. Williams, C. O. Qualset, A. Van Deynze, Reconstruction of the synthetic W7984 x Opata M85 wheat reference population. Genome 54, 875–882 (2011). Medline doi:10.1139/g11-054
100. P. Sourdille, T. Cadalen, H. Guyomarc’h, J. W. Snape, M. R. Perretant, G. Charmet, C. Boeuf, S. Bernard, M. Bernard, An update of the Courtot x Chinese Spring intervarietal molecular marker linkage map for the QTL detection of agronomic traits in wheat. Theor. Appl. Genet. 106, 530–538 (2003). Medline
101. T. W. Banks, M. C. Jordan, D. J. Somers, Single-Feature Polymorphism Mapping in Bread Wheat. Plant Gen. 2, 167 (2009). doi:10.3835/plantgenome2009.02.0009
102. P. A. Wilkinson, M. O. Winfield, G. L. Barker, A. M. Allen, A. Burridge, J. A. Coghill, K. J. Edwards, CerealsDB 2.0: An integrated resource for plant breeders and scientists. BMC Bioinformatics 13, 219 (2012). Medline doi:10.1186/1471-2105-13-219
103. A. M. Allen, G. L. Barker, P. Wilkinson, A. Burridge, M. Winfield, J. Coghill, C. Uauy, S. Griffiths, P. Jack, S. Berry, P. Werner, J. P. Melichar, J. McDougall, R. Gwilliam, P. Robinson, K. J. Edwards, Discovery and development of exome-based, co-dominant
71
single nucleotide polymorphism markers in hexaploid wheat (Triticum aestivum L.). Plant Biotechnol. J. 11, 279–295 (2013). Medline doi:10.1111/pbi.12009
104. M. O. Winfield, P. A. Wilkinson, A. M. Allen, G. L. Barker, J. A. Coghill, A. Burridge, A. Hall, R. C. Brenchley, R. D’Amore, N. Hall, M. W. Bevan, T. Richmond, D. J. Gerhardt, J. A. Jeddeloh, K. J. Edwards, Targeted re-sequencing of the allohexaploid wheat exome. Plant Biotechnol. J. 10, 733–742 (2012). Medline doi:10.1111/j.1467-7652.2012.00713.x
105. P. J. Bradbury, Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss, E. S. Buckler, TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics 23, 2633–2635 (2007). Medline doi:10.1093/bioinformatics/btm308
106. O. Sosnowski, A. Charcosset, J. Joets, BioMercator V3: An upgrade of genetic map compilation and quantitative trait loci meta-analysis algorithms. Bioinformatics 28, 2082–2083 (2012). Medline doi:10.1093/bioinformatics/bts313
107. J. C. Zadoks, T. T. Chang, C. F. Konzak, A decimal code for the growth stages of cereals. Weed Res. 14, 415–421 (1974). doi:10.1111/j.1365-3180.1974.tb01084.x
108. C. Groos, N. Robert, E. Bervas, G. Charmet, Genetic analysis of grain protein-content, grain yield and thousand-kernel weight in bread wheat. Theor. Appl. Genet. 106, 1032–1040 (2003). Medline
109. D. An, J. Su, Q. Liu, Y. Zhu, Y. Tong, J. Li, R. Jing, B. Li, Z. Li, Mapping QTLs for nitrogen uptake in relation to the early growth of wheat, Triticum aestivum L. Plant Soil 284, 73–84 (2006). doi:10.1007/s11104-006-0030-3
110. D. Z. Habash, S. Bernard, J. Schondelmaier, J. Weyen, S. A. Quarrie, The genetics of nitrogen use in hexaploid wheat: N utilisation, development and yield. Theor. Appl. Genet. 114, 403–419 (2007). Medline doi:10.1007/s00122-006-0429-5
111. A. Laperche, M. Brancourt-Hulmel, E. Heumez, O. Gardet, E. Hanocq, F. Devienne-Barret, J. Le Gouis, Using genotype x nitrogen interaction variables to evaluate the QTL involved in wheat tolerance to nitrogen constraints. Theor. Appl. Genet. 115, 399–415 (2007). Medline doi:10.1007/s00122-007-0575-4
112. Z. Li et al., Molecular mapping of QTLs for root response to phosphorus deficiency at seedling stage in wheat (Triticum aestivum L.). Prog. Nat. Sci. 17, 1177 (2007).
113. J.-X. Fontaine, C. Ravel, K. Pageau, E. Heumez, F. Dubois, B. Hirel, J. Le Gouis, A quantitative genetic study for elucidating the contribution of glutamine synthetase, glutamate dehydrogenase and other nitrogen-related physiological traits to the agronomic performance of common wheat. Theor. Appl. Genet. 119, 645–662 (2009). Medline doi:10.1007/s00122-009-1076-4
114. Y. Zhang, J. Tang, Y. Zhang, J. Yan, Y. Xiao, Y. Zhang, X. Xia, Z. He, QTL mapping for quantities of protein fractions in bread wheat (Triticum aestivum L.). Theor. Appl. Genet. 122, 971–987 (2011). Medline doi:10.1007/s00122-010-1503-6
115. D. Bennett, A. Izanloo, M. Reynolds, H. Kuchel, P. Langridge, T. Schnurbusch, Genetic dissection of grain yield and physical grain quality in bread wheat (Triticum aestivum L.) under water-limited environments. Theor. Appl. Genet. 125, 255–271 (2012). Medline doi:10.1007/s00122-012-1831-9
72
116. M. Bogard, M. Jourdan, V. Allard, P. Martre, M. R. Perretant, C. Ravel, E. Heumez, S. Orford, J. Snape, S. Griffiths, O. Gaju, J. Foulkes, J. Le Gouis, Anthesis date mainly explained correlations between post-anthesis leaf senescence, grain yield, and grain protein concentration in a winter wheat population segregating for flowering time QTLs. J. Exp. Bot. 62, 3621–3636 (2011). Medline doi:10.1093/jxb/err061
117. Y. Guo, F. M. Kong, Y. F. Xu, Y. Zhao, X. Liang, Y. Y. Wang, D. G. An, S. S. Li, QTL mapping for seedling traits in wheat grown under varying concentrations of N, P and K nutrients. Theor. Appl. Genet. 124, 851–865 (2012). Medline doi:10.1007/s00122-011-1749-7
118. M. Bogard, V. Allard, P. Martre, E. Heumez, J. W. Snape, S. Orford, S. Griffiths, O. Gaju, J. Foulkes, J. Gouis, Identifying wheat genomic regions for improving grain protein concentration independently of grain yield using multiple inter-related populations. Mol. Breed. 31, 587–599 (2013). doi:10.1007/s11032-012-9817-5
119. X. Liu, R. Li, X. Chang, R. Jing, Mapping QTLs for seedling root traits in a doubled haploid wheat population under different water regimes. Euphytica 189, 51–66 (2013). doi:10.1007/s10681-012-0690-4
120. S. Griffiths, J. Simmonds, M. Leverington, Y. Wang, L. Fish, L. Sayers, L. Alibert, S. Orford, L. Wingen, L. Herry, S. Faure, D. Laurie, L. Bilham, J. Snape, Meta-QTL analysis of the genetic control of ear emergence in elite European winter wheat germplasm. Theor. Appl. Genet. 119, 383–395 (2009). Medline doi:10.1007/s00122-009-1046-x
121. S.-L. Mao, Y.-M. Wei, W. Cao, X.-J. Lan, M. Yu, Z.-M. Chen, G.-Y. Chen, Y.-L. Zheng, Confirmation of the relationship between plant height and Fusarium head blight resistance in wheat (Triticum aestivum L.) by QTL meta-analysis. Euphytica 174, 343–356 (2010). doi:10.1007/s10681-010-0128-9
122. L.-Y. Zhang, D. C. Liu, X. L. Guo, W. L. Yang, J. Z. Sun, D. W. Wang, A. Zhang, Genomic distribution of quantitative trait loci for yield and yield-related traits in common wheat. J. Integr. Plant Biol. 52, 996–1007 (2010). Medline doi:10.1111/j.1744-7909.2010.00967.x
123. U. M. Quraishi, M. Abrouk, F. Murat, C. Pont, S. Foucrier, G. Desmaizieres, C. Confolent, N. Rivière, G. Charmet, E. Paux, A. Murigneux, L. Guerreiro, S. Lafarge, J. Le Gouis, C. Feuillet, J. Salse, Cross-genome map based dissection of a nitrogen use efficiency ortho-metaQTL in bread wheat unravels concerted cereal genome evolution. Plant J. 65, 745–756 (2011). Medline doi:10.1111/j.1365-313X.2010.04461.x
124. S. Griffiths, J. Simmonds, M. Leverington, Y. Wang, L. Fish, L. Sayers, L. Alibert, S. Orford, L. Wingen, J. Snape, Meta-QTL analysis of the genetic control of crop height in elite European winter wheat germplasm. Mol. Breed. 29, 159–171 (2012). doi:10.1007/s11032-010-9534-x
125. T. Wicker, F. Sabot, A. Hua-Van, J. L. Bennetzen, P. Capy, B. Chalhoub, A. Flavell, P. Leroy, M. Morgante, O. Panaud, E. Paux, P. SanMiguel, A. H. Schulman, A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 8, 973–982 (2007). Medline doi:10.1038/nrg2165
73