Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
The banana (Musa acuminata) genome sequence
Patrick Wincker
Angélique D’Hont Cirad
Taxonomy
Banana is a Giant Herb
- Monocotyledon - Commelinids lineage - Zingiberales order - Musaceae family
adapted from Chase, 2004
POALES: rice, sorghum, brachypodium, maize
ZINGIBERALES: banana
ARECALES: oil palm
ASPARAGALES
LILIALES
PANDANALES
DIOSCOREALES
PETROSAVIALES
ALISMATALES
ASCORALES
Commelinids
Musa balbisiana
Radiation of wild Musa/Domestication
Domestication involved:
- hybridization between species and subspecies made possible by human migration
- selection of diploid and triploid, seedless, parthenocapic hybrids by early farmers
Musa acuminata
truncata
errans
zebrina
malaccensis
burmanica
/siamea
banksii
microcarpa
subspecies with distinct chromosome structures
Central for food security in many (sub)-tropical countries
Hundreds of varieties do exist
Million Tonnes AAB AAB AAA AAA
Plantain Others Mutika Cavendish Total
Central/South America &
Caraibs 7.7 1.5 24.5 33.7
West Africa 9.0 1.1 3.2 13.3
East Africa 1.3 13.4 2.6 17.3
North Africa & Middle East 1.7 1.7
India & Sri Lanka 0.6 2.2 9.4 12.2
Asia & Oceania 0.2 8.4 16.5 25.1
Total World 18.8 13.2 13.4 57.9 (>50%) 103.3 100%
Local consumption 18.8 12.2 13.4 45.2 89.6 87%
Exportation 13.8 13.8 13%
Somaclones from one zygote
Cavendish
but > 50% world production
Monoculture of Cavendish somaclones -> Highly vulnerable -> Devastating diseases -> Extensive use of pesticides
Breeding very complex Cultivated bananas, high level of sterility, producing seedless fruits - structural heterozygosity - mainly triploid - vegetative propagation
Breeding
--> Urgent need of improved cultivars
-> Production of a reference whole genome sequence of banana
Pahang Doubled-Haploid
from anther culture
Genome size : 520 Mb
Pahang
Wild diploid (2n=22)
Species: M. acuminata
Subspecies: malaccencis
Pahang doubled haploid
Bakry et al Fruits 2008
-> Haploids available only for one M. acuminata genotype : Pahang
Principles of Musa genome assembly
27,495,411
reads (400bp)
Assembled
24,425 contigs
+ Paired-end sequences
10kb, 100kb 2,139,909 reads
+
7,513 scaffolds
Assembled
Genetic map
Anchoring on genetic
map
11 pseudo-molecules
Musa assembly
Technology Reads Coverage
454 single 27 495 411 16.9 X
Sanger 2 139 909 3.9 X
20.8 X
Assembly using Newbler, then consensus correction using 50x Illumina GAIIx data
Sequence assembly = 472 Mb = 91% of the DH-Pahang genome
Musa assembly
DH-Pahang genome size : 520 Mb
Number
N50 size (number)
Size (cumulated)
Contigs 24,425 43,1 kb 390 Mb
Scaffolds 7,513 1.3 Mb (65) 472 Mb
Pahang fruit Embryo rescue from seeds Pahang progeny
Development of the mapping population
Progeny issued from Musa acuminata ‘Pahang’ self-fertilisation
• creation: self-fertilisation (C. Jenny) • embryo rescue (R. Habas, S. Joseph, F. Bakry) : 4 hands - 820 seeds 618 embryos • growth (R. Habas, F. Carreel) : 441 hybrids • ploidy analyses (S. Joseph, F. Bakry, M. Rodier, F. Carreel) through flow cytometry: all diploids except two 4x • DNA extraction (M. Souquet, C. Cardi, F. Carreel) : 441 hybrids
180 individuals used to build the genetic map for the sequence alignment
UMR AGAP/APMV, UMR AGAP/SEG, UMR BGPI/BECj
Development and genotyping of molecular markers
SSR (F.C. Baurens, M. Souquet, C. Cardi, R. Rivallan, A.M. Risterucci, F. Carreel) Total tested 2 454 markers from previous studies 16% (386) developed within project (F.C. Baurens) BES + scaffolds 84% (2068)
DArT (A. Kilian, F.C. Baurens, F. Carreel)
Total tested, an array of 15 360 markers from previous studies 50 % New probes, enriched in Pahang 50 %
48 % ( 1 185) polymorphic 589 SSR used for the anchorage (max 3 by scaffold)
1 008 polymorphic with 534 unique 63 DArT used for the anchorage (for scaffold < 3 SSR)
(Aus), UMR AGAP/GPTR Génotypage et UMR AGAP/SEG
Anchoring of the 11 chromosomes
Chr1 Chr2 Chr3
• 647 markers used for the anchoring • 258 scaffolds including 98.0% of the scaffolds larger than 1Mb
Chr4 Chr5 Chr6 Chr7
Chr8 Chr9 Chr10 Chr11
Genetic map
Physical map Scaffolds
1 cM 10 Mb
Assembly HD Genes
472 Mb 523 Mb 36 000
Anchored 70% (332 Mb) 64% 92%
Oriented
47% (221Mb) 42% 84%
cDNAs
Reconciliation (GAZE)
ESTs
Known peptides
ab initio Predictions
Repeats (Transposable elements, ..)
Gene Models
Musa : 91 041 (GMGC) Monocot : 6 888 879
Flowers Bracts
Old leaves
Cigar leaves
Fruits
Musa : New cDNA 829,587 reads
Musa RNASeq 143 682 857reads
Musa TE (from 64 BAC and 454seq) Plant TE (Repbase) ab initio (RepeatScout)
Plant peptides (UniProtKB)
Gene annotation process
Masking
Geneid, SNAP, FGENESH (Training set of 400 manually annotated genes)
Musa acuminata predicted proteome
Number of protein coding genes
36,542
Median nb exon per gene 4
Median size of CDS 861 bp
Median size of intron 147 bp
Avg. % GC of CDS 50.2%
36,542 predicted protein coding genes
CDS 9%
Intron 20%
UTR 1%
Intergenic sequences 70%
Protein coding genes 30%
STEP1: Construction of a Pahang TE reference library - LTR retrotransposons (Copia, Gypsy) (E. Hribova, IEB, Czech Republic + O. Garsmeur) - Non-LTR retrotransposons (LINEs) (P. Heslop-Harrison, Univ Leicester, UK) - DNA transposons (sub-classes 1) (T. Wicker, Univ Zurich, Switzerland)
Transposable elements (TE) annotation
Transposable element distribution
Only 1.3% of DNA transposons mainly hAT, no CATCA and Mariner found so far in high copy numbers in all angiosperms
44 % of transposable elements annotated in the genome assembly
Genetic map
Gene intron Gene exon
STEP2: Screening of the Musa assembly with REPET package (TEannot pipeline)
Genome landscape
0
10
20
30
40
Mb
45S
5S
-> Sharp transition between gene-rich and TE-rich regions
Annotation of repeats in the non-assembled part of the genome
Element Size (bp) sequence nb copy nb Mb
its 26S 3806 173294 979 3,7
its 18S 4179 263846 1358 5,7
its 5S 521 56534 2334 1,2
Daterra LTR-retro Copia 7324 336947 989 7,2
Nanica LINE 5291 25270 103 0,5
Maca LTR-retro Copia 1842 6771 79 0,1
Caturra LTR-retro Gypsy 5405 66916 266 1,4
Total 1832094
centromeric
Nanica
Location by in-situ hybridization
peri-centromeric
peri-centromeric all over chromosomes
FISH: Maguy Rodier + Dheema Burthia (AREU, Mauritius)
O. Panaud (Univ Perpignan, France)+ J. Barbosa (Univ Rio Grande, Brazil)
Typical short tandem centromeric repeats were not found in Musa.
2 1 3 4 5 6 7 8 9 10 11
Viral sequences related to Banana streak virus (BSV) integrated in Pahang genome
- highly reorganized and fragmented
- spread all over the genome
- belong to a badnavirus phylogenetic group different from the endogenous BSV species (eBSV) found in M. balbisiana
- most of them formed a new subgroup/species
eBSV from M. balbisiana
New species
Viral sequences integrated in the M. acuminata Pahang genome
Seem defective and not able to restitute free infectious viral particles
0.7
2
1.7
5
200
400
600
800
1000
1200
1400
1600
0.1
2
0.1
9
0.2
9
0.4
6
1.1
2
2.7
3
4.2
6
Ks
Gene-p
airs
KS
Complex pattern of small duplicated chromosome segments, important variation of Ks depending on clusters, suggesting several WGD events
Whole genome duplications (WGD)
Paralogous relationships between the eleven Musa chromosomes
Musa1 2 3 4 5 6 7 8 9 10 11
1
2
3
4
5
6
7
8
9
10
11
Mu
sa
COGE / Synmap ; http://synteny.cnr.berkeley.edu/CoGe/SynMap.pl
Musa1 2 3 4 5 6 7 8 9 10 11
1
2
3
4
5
6
7
8
9
10
11
Mu
saWhole genome duplications (WGD)
Duplicated segments involve each mainly 4 syntenic regions
At least 2 WGD in the Musa lineage : α and β
COGE / Synmap ; http://synteny.cnr.berkeley.edu/CoGe/SynMap.pl
Paralogous relationships between the eleven Musa chromosomes
Block1 Block2 Block3
Whole genome duplications in Musa
Assemble paralogous clusters corresponding to the lowest Ks values in 12 ancestral blocks Represent the Musa ancestral genome before α+β WGDs
1 2 3 4 5 6 7 8 9 10 11 12
Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11
40Mb
0Mb
10Mb
20Mb
30Mb
Blocks
Block 2
Block 3
chr1
chr2
chr3
chr4
chr5
chr6 chr7
chr8
chr9
chr10
chr11
chr1
chr2
chr3
chr4
chr5
chr6 chr7
chr8
chr9
chr10
chr11
Whole genome duplications in Musa
Additional paralogous relationships between the 12 Musa ancestral blocks with higher Ks values suggesting another older duplication
Two WGDs : α and β
Older WGD : γ
3 WGDs events identified in Musa : 2 close events (α and β) WGD + 1 older (γ) event
0 0.5 1 1.5 2 2.5 3
0
10
20
30
Ks
Perc
enta
ge o
f ge
ne-
pai
rs
High Ks Older WGD
Segment of Block 1
Segment of Block 3
50 25 075100125
Jurassic Cretaceous Tertiary
Oryza_sativa
Brachypodium_distachyon
Sorghum_bicolor
Zingiber_officinale
Musa_acuminata
Rho Poales
Zingiberales
?
α β γ
Sigma
Timing of WGD events relative to speciation events
Two WGDs events (Rho, Sigma) identified in the Poales lineage:
- the timing of Rho WGD is well established around 70 Mya
- the timing of Sigma (Tang et al 2010) relative to divergence of Poales and Zingiberales was less clear
Comparative genome analysis to examine whether the Musa γ WGD corresponds to Sigma WGD of the Poales
Comparison of Musa and rice contemporaneous genomes
1 2 3 4 5 6 7 8 9 10 11
1
2
3
4
5
6
7
8
9 10 11 12
Musa
Ory
za s
ati
va
Syntenic clusters of genes between Musa and rice
Scarce synteny conservation, pattern compatible with several independent WGD events in the rice and Musa lineages, followed by gene loss (fractionation) and chromosomal rearrangements
Comparison of Musa and rice ancestral genomes
Divergence Zingiberales Poales
Alpha WGD
β Banana blocks
Gamma WGD
2 8 b
Beta WGD
Sigma WGD
5 2
Sigma blocks
Rho blocks
Rho WGD
Ancestral blocks represent approximation of genome composition before WGD and thus account for post-WGD gene losses increasing analysis sensitivity
Gamma and Sigma WGDs occurred after the Poales/Zingiberales divergence
Rho5
Rho2
Sigma6
Musa
Block 8
Musa
Block 2
Rho and Sigma blocks from Tang et al (2011)
Timing of WGD events relative to speciation events
Phylogenomic analysis of 93 single copy genes
Suggest Zingiberales more closely related to Arecales than to Poales
50 25 0 75 100 125
Jurassic Cretaceous Tertiary Oryza
Brachypodium
Sorghum
Zingiber
Musa
Phoenix
Asparagus
Acorus
Vitis
Medicago
Populus
Carica
Arabidopsis
Eudicots
Monocots
Poales
Zingiberales
Arecales
Timing of WGD events relative to speciation events
The three Musa WGDs occurred after the divergence of Zingiberales from Arecales and Poales
Jurassic Cretaceous Tertiary
Oryza
Brachypodium
Sorghum
Zingiber
Musa
Phoenix
Asparagus
Acorus
Vitis
Medicago
Populus
Carica
Arabidopsis
50 25 0 75 100 125
?
σ ρ
γ β α
γ
α β
Poales
Zingiberales
Arecales
Eudicots
Monocots
Phylogenomic analyses performed on 3,553 gene families
Over-retention of Musa transcription factors after WGD
-> Amplification of several banana transcription factor gene families
Gene loss WGD
Collaboration Mathieu Rouard (Bioversity)
Distribution of gene “families” among banana, three Poaceae, date-palm and Arabidopsis
Suggest a high level of divergence and diversification within the Poaceae lineage
Enriched in genes coding for transcription factors, defense related proteins, cell wall metabolism and secondary metabolism enzymes
Collaboration Mathieu Rouard (Bioversity)
Conclusion
Crucial stepping-stone for genetic improvement of this under-researched vital crop
An essential bridge for genes and genomes evolution studies within Monocotyledons and with Dicotyledons
The reference whole genome sequence of banana
http://banana-genome.cirad.fr/
http://southgreen.cirad.fr/http://southgreen.cirad.fr/http://southgreen.cirad.fr/
A new template for banana genetics
✔ Unlimited (almost) source of DNA markers located on the chromosomes - SSR: from several hundreds now several thousands located on the chromosomes - SNP: a template for SNP discovery and mapping for genetic and diversity studies
✔ A template to characterize chromosome structural variations = insertion, deletion, inversion, duplication, translocation and subsequent prospects for germplasm improvement
✔Access for the first time to the entire set of Musa genes (36 542) - Template for transcriptomic analysis - Candidate gene strategies based on physiological studies and insights from other species ; freeway to crosscutting evidence
Nabila Yahiaoui Franc-Christophe Baurens Françoise Carreel Olivier Garsmeur Stéphanie Bocs Gaetan Droc Céline Cardi Marlène Souquet Cyril Jourda Juliette Lengelle Marguerite Rodier Didier Mbéguié Matthieu Chabannes Rémy Habas Ronan Rivallan Philippe Francois Claire Poiron Christophe Jenny Frédéric Bakry Steeve Joseph Anne Dievart Julie Leclercq Xavier Argout Ange-Marie Risterucci Manuel Ruiz Jean Christophe Glaszmann
Patrick Wincker France Denoeud Jean-Marc Aury Benjamin Noel Corinne Da Silva Kamel Jabbari Julie Poulain Karine Labadie Adriana Alberti Maria Bernard Margot Correa Olivier Jaillon Jean Weissenbach
Mathieu Rouard (Bioversity, France) Valentin Guignon (Gene families analysis)
Thomas Wicker (Univ Zurich, Switzerland) Eva Hribova (IEB, Czech Republic) Jaroslav Dolezel (IEB, Czech Republic) Pat Heslop-Harrison (Univ Leicester, UK) Olivier Panaud (Univ Perpignan, France) José Barbosa (Univ Rio Grande, Brazil) Dheema Burthia (AREU, Mauritius) Mouna Jeridi (IRA, Tunisia) (Transposable element analysis +FISH)
Andrzej Kilian (DArT, Canberra, Australia) (DArT Developpment and genotyping)
Nicolas Roux (Bioversity, France) Gert Kema (PRI, Wageningen) (provide MAMB BAC-end sequence)
Dutch Groene Woudt Foundation
Diane Burgess Mike Freeling (Univ Berkley, USA) (CNS analysis)
Michael McKai Saravanaraj Ayyampalayam Jim Leebens-Mack (Univ Georgia, USA) (Phylogenomic analysis)
Eric Lyons (Univ Arizona, USA) (COGE tools)
Francis Quetier
Spencer Brown (ISV, Gif sur Yvette) (genome size)