32
The banana (Musa acuminata) genome sequence Patrick Wincker Angélique D’Hont Cirad

The banana (Musa acuminata) genome sequence · 2016. 3. 7. · N50 size (number) Size (cumulated) Contigs 24,425 43,1 kb 390 Mb ... Monocot : 6 888 879 Flowers Bracts Old leaves Cigar

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • The banana (Musa acuminata) genome sequence

    Patrick Wincker

    Angélique D’Hont Cirad

  • Taxonomy

    Banana is a Giant Herb

    - Monocotyledon - Commelinids lineage - Zingiberales order - Musaceae family

    adapted from Chase, 2004

    POALES: rice, sorghum, brachypodium, maize

    ZINGIBERALES: banana

    ARECALES: oil palm

    ASPARAGALES

    LILIALES

    PANDANALES

    DIOSCOREALES

    PETROSAVIALES

    ALISMATALES

    ASCORALES

    Commelinids

  • Musa balbisiana

    Radiation of wild Musa/Domestication

    Domestication involved:

    - hybridization between species and subspecies made possible by human migration

    - selection of diploid and triploid, seedless, parthenocapic hybrids by early farmers

    Musa acuminata

    truncata

    errans

    zebrina

    malaccensis

    burmanica

    /siamea

    banksii

    microcarpa

    subspecies with distinct chromosome structures

  • Central for food security in many (sub)-tropical countries

    Hundreds of varieties do exist

    Million Tonnes AAB AAB AAA AAA

    Plantain Others Mutika Cavendish Total

    Central/South America &

    Caraibs 7.7 1.5 24.5 33.7

    West Africa 9.0 1.1 3.2 13.3

    East Africa 1.3 13.4 2.6 17.3

    North Africa & Middle East 1.7 1.7

    India & Sri Lanka 0.6 2.2 9.4 12.2

    Asia & Oceania 0.2 8.4 16.5 25.1

    Total World 18.8 13.2 13.4 57.9 (>50%) 103.3 100%

    Local consumption 18.8 12.2 13.4 45.2 89.6 87%

    Exportation 13.8 13.8 13%

    Somaclones from one zygote

    Cavendish

    but > 50% world production

  • Monoculture of Cavendish somaclones -> Highly vulnerable -> Devastating diseases -> Extensive use of pesticides

    Breeding very complex Cultivated bananas, high level of sterility, producing seedless fruits - structural heterozygosity - mainly triploid - vegetative propagation

    Breeding

    --> Urgent need of improved cultivars

    -> Production of a reference whole genome sequence of banana

  • Pahang Doubled-Haploid

    from anther culture

    Genome size : 520 Mb

    Pahang

    Wild diploid (2n=22)

    Species: M. acuminata

    Subspecies: malaccencis

    Pahang doubled haploid

    Bakry et al Fruits 2008

    -> Haploids available only for one M. acuminata genotype : Pahang

  • Principles of Musa genome assembly

    27,495,411

    reads (400bp)

    Assembled

    24,425 contigs

    + Paired-end sequences

    10kb, 100kb 2,139,909 reads

    +

    7,513 scaffolds

    Assembled

    Genetic map

    Anchoring on genetic

    map

    11 pseudo-molecules

  • Musa assembly

    Technology Reads Coverage

    454 single 27 495 411 16.9 X

    Sanger 2 139 909 3.9 X

    20.8 X

    Assembly using Newbler, then consensus correction using 50x Illumina GAIIx data

    Sequence assembly = 472 Mb = 91% of the DH-Pahang genome

    Musa assembly

    DH-Pahang genome size : 520 Mb

    Number

    N50 size (number)

    Size (cumulated)

    Contigs 24,425 43,1 kb 390 Mb

    Scaffolds 7,513 1.3 Mb (65) 472 Mb

  • Pahang fruit Embryo rescue from seeds Pahang progeny

    Development of the mapping population

    Progeny issued from Musa acuminata ‘Pahang’ self-fertilisation

    • creation: self-fertilisation (C. Jenny) • embryo rescue (R. Habas, S. Joseph, F. Bakry) : 4 hands - 820 seeds 618 embryos • growth (R. Habas, F. Carreel) : 441 hybrids • ploidy analyses (S. Joseph, F. Bakry, M. Rodier, F. Carreel) through flow cytometry: all diploids except two 4x • DNA extraction (M. Souquet, C. Cardi, F. Carreel) : 441 hybrids

    180 individuals used to build the genetic map for the sequence alignment

    UMR AGAP/APMV, UMR AGAP/SEG, UMR BGPI/BECj

  • Development and genotyping of molecular markers

    SSR (F.C. Baurens, M. Souquet, C. Cardi, R. Rivallan, A.M. Risterucci, F. Carreel) Total tested 2 454 markers from previous studies 16% (386) developed within project (F.C. Baurens) BES + scaffolds 84% (2068)

    DArT (A. Kilian, F.C. Baurens, F. Carreel)

    Total tested, an array of 15 360 markers from previous studies 50 % New probes, enriched in Pahang 50 %

    48 % ( 1 185) polymorphic 589 SSR used for the anchorage (max 3 by scaffold)

    1 008 polymorphic with 534 unique 63 DArT used for the anchorage (for scaffold < 3 SSR)

    (Aus), UMR AGAP/GPTR Génotypage et UMR AGAP/SEG

  • Anchoring of the 11 chromosomes

    Chr1 Chr2 Chr3

    • 647 markers used for the anchoring • 258 scaffolds including 98.0% of the scaffolds larger than 1Mb

    Chr4 Chr5 Chr6 Chr7

    Chr8 Chr9 Chr10 Chr11

    Genetic map

    Physical map Scaffolds

    1 cM 10 Mb

    Assembly HD Genes

    472 Mb 523 Mb 36 000

    Anchored 70% (332 Mb) 64% 92%

    Oriented

    47% (221Mb) 42% 84%

  • cDNAs

    Reconciliation (GAZE)

    ESTs

    Known peptides

    ab initio Predictions

    Repeats (Transposable elements, ..)

    Gene Models

    Musa : 91 041 (GMGC) Monocot : 6 888 879

    Flowers Bracts

    Old leaves

    Cigar leaves

    Fruits

    Musa : New cDNA 829,587 reads

    Musa RNASeq 143 682 857reads

    Musa TE (from 64 BAC and 454seq) Plant TE (Repbase) ab initio (RepeatScout)

    Plant peptides (UniProtKB)

    Gene annotation process

    Masking

    Geneid, SNAP, FGENESH (Training set of 400 manually annotated genes)

  • Musa acuminata predicted proteome

    Number of protein coding genes

    36,542

    Median nb exon per gene 4

    Median size of CDS 861 bp

    Median size of intron 147 bp

    Avg. % GC of CDS 50.2%

    36,542 predicted protein coding genes

    CDS 9%

    Intron 20%

    UTR 1%

    Intergenic sequences 70%

    Protein coding genes 30%

  • STEP1: Construction of a Pahang TE reference library - LTR retrotransposons (Copia, Gypsy) (E. Hribova, IEB, Czech Republic + O. Garsmeur) - Non-LTR retrotransposons (LINEs) (P. Heslop-Harrison, Univ Leicester, UK) - DNA transposons (sub-classes 1) (T. Wicker, Univ Zurich, Switzerland)

    Transposable elements (TE) annotation

  • Transposable element distribution

    Only 1.3% of DNA transposons mainly hAT, no CATCA and Mariner found so far in high copy numbers in all angiosperms

    44 % of transposable elements annotated in the genome assembly

    Genetic map

    Gene intron Gene exon

    STEP2: Screening of the Musa assembly with REPET package (TEannot pipeline)

  • Genome landscape

    0

    10

    20

    30

    40

    Mb

    45S

    5S

    -> Sharp transition between gene-rich and TE-rich regions

  • Annotation of repeats in the non-assembled part of the genome

    Element Size (bp) sequence nb copy nb Mb

    its 26S 3806 173294 979 3,7

    its 18S 4179 263846 1358 5,7

    its 5S 521 56534 2334 1,2

    Daterra LTR-retro Copia 7324 336947 989 7,2

    Nanica LINE 5291 25270 103 0,5

    Maca LTR-retro Copia 1842 6771 79 0,1

    Caturra LTR-retro Gypsy 5405 66916 266 1,4

    Total 1832094

    centromeric

    Nanica

    Location by in-situ hybridization

    peri-centromeric

    peri-centromeric all over chromosomes

    FISH: Maguy Rodier + Dheema Burthia (AREU, Mauritius)

    O. Panaud (Univ Perpignan, France)+ J. Barbosa (Univ Rio Grande, Brazil)

    Typical short tandem centromeric repeats were not found in Musa.

  • 2 1 3 4 5 6 7 8 9 10 11

    Viral sequences related to Banana streak virus (BSV) integrated in Pahang genome

    - highly reorganized and fragmented

    - spread all over the genome

    - belong to a badnavirus phylogenetic group different from the endogenous BSV species (eBSV) found in M. balbisiana

    - most of them formed a new subgroup/species

    eBSV from M. balbisiana

    New species

    Viral sequences integrated in the M. acuminata Pahang genome

    Seem defective and not able to restitute free infectious viral particles

  • 0.7

    2

    1.7

    5

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    0.1

    2

    0.1

    9

    0.2

    9

    0.4

    6

    1.1

    2

    2.7

    3

    4.2

    6

    Ks

    Gene-p

    airs

    KS

    Complex pattern of small duplicated chromosome segments, important variation of Ks depending on clusters, suggesting several WGD events

    Whole genome duplications (WGD)

    Paralogous relationships between the eleven Musa chromosomes

    Musa1 2 3 4 5 6 7 8 9 10 11

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    Mu

    sa

    COGE / Synmap ; http://synteny.cnr.berkeley.edu/CoGe/SynMap.pl

  • Musa1 2 3 4 5 6 7 8 9 10 11

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    Mu

    saWhole genome duplications (WGD)

    Duplicated segments involve each mainly 4 syntenic regions

    At least 2 WGD in the Musa lineage : α and β

    COGE / Synmap ; http://synteny.cnr.berkeley.edu/CoGe/SynMap.pl

    Paralogous relationships between the eleven Musa chromosomes

  • Block1 Block2 Block3

    Whole genome duplications in Musa

    Assemble paralogous clusters corresponding to the lowest Ks values in 12 ancestral blocks Represent the Musa ancestral genome before α+β WGDs

    1 2 3 4 5 6 7 8 9 10 11 12

    Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11

    40Mb

    0Mb

    10Mb

    20Mb

    30Mb

    Blocks

    Block 2

    Block 3

    chr1

    chr2

    chr3

    chr4

    chr5

    chr6 chr7

    chr8

    chr9

    chr10

    chr11

    chr1

    chr2

    chr3

    chr4

    chr5

    chr6 chr7

    chr8

    chr9

    chr10

    chr11

  • Whole genome duplications in Musa

    Additional paralogous relationships between the 12 Musa ancestral blocks with higher Ks values suggesting another older duplication

    Two WGDs : α and β

    Older WGD : γ

    3 WGDs events identified in Musa : 2 close events (α and β) WGD + 1 older (γ) event

    0 0.5 1 1.5 2 2.5 3

    0

    10

    20

    30

    Ks

    Perc

    enta

    ge o

    f ge

    ne-

    pai

    rs

    High Ks Older WGD

    Segment of Block 1

    Segment of Block 3

  • 50 25 075100125

    Jurassic Cretaceous Tertiary

    Oryza_sativa

    Brachypodium_distachyon

    Sorghum_bicolor

    Zingiber_officinale

    Musa_acuminata

    Rho Poales

    Zingiberales

    ?

    α β γ

    Sigma

    Timing of WGD events relative to speciation events

    Two WGDs events (Rho, Sigma) identified in the Poales lineage:

    - the timing of Rho WGD is well established around 70 Mya

    - the timing of Sigma (Tang et al 2010) relative to divergence of Poales and Zingiberales was less clear

    Comparative genome analysis to examine whether the Musa γ WGD corresponds to Sigma WGD of the Poales

  • Comparison of Musa and rice contemporaneous genomes

    1 2 3 4 5 6 7 8 9 10 11

    1

    2

    3

    4

    5

    6

    7

    8

    9 10 11 12

    Musa

    Ory

    za s

    ati

    va

    Syntenic clusters of genes between Musa and rice

    Scarce synteny conservation, pattern compatible with several independent WGD events in the rice and Musa lineages, followed by gene loss (fractionation) and chromosomal rearrangements

  • Comparison of Musa and rice ancestral genomes

    Divergence Zingiberales Poales

    Alpha WGD

    β Banana blocks

    Gamma WGD

    2 8 b

    Beta WGD

    Sigma WGD

    5 2

    Sigma blocks

    Rho blocks

    Rho WGD

    Ancestral blocks represent approximation of genome composition before WGD and thus account for post-WGD gene losses increasing analysis sensitivity

    Gamma and Sigma WGDs occurred after the Poales/Zingiberales divergence

    Rho5

    Rho2

    Sigma6

    Musa

    Block 8

    Musa

    Block 2

    Rho and Sigma blocks from Tang et al (2011)

  • Timing of WGD events relative to speciation events

    Phylogenomic analysis of 93 single copy genes

    Suggest Zingiberales more closely related to Arecales than to Poales

    50 25 0 75 100 125

    Jurassic Cretaceous Tertiary Oryza

    Brachypodium

    Sorghum

    Zingiber

    Musa

    Phoenix

    Asparagus

    Acorus

    Vitis

    Medicago

    Populus

    Carica

    Arabidopsis

    Eudicots

    Monocots

    Poales

    Zingiberales

    Arecales

  • Timing of WGD events relative to speciation events

    The three Musa WGDs occurred after the divergence of Zingiberales from Arecales and Poales

    Jurassic Cretaceous Tertiary

    Oryza

    Brachypodium

    Sorghum

    Zingiber

    Musa

    Phoenix

    Asparagus

    Acorus

    Vitis

    Medicago

    Populus

    Carica

    Arabidopsis

    50 25 0 75 100 125

    ?

    σ ρ

    γ β α

    γ

    α β

    Poales

    Zingiberales

    Arecales

    Eudicots

    Monocots

    Phylogenomic analyses performed on 3,553 gene families

  • Over-retention of Musa transcription factors after WGD

    -> Amplification of several banana transcription factor gene families

    Gene loss WGD

    Collaboration Mathieu Rouard (Bioversity)

  • Distribution of gene “families” among banana, three Poaceae, date-palm and Arabidopsis

    Suggest a high level of divergence and diversification within the Poaceae lineage

    Enriched in genes coding for transcription factors, defense related proteins, cell wall metabolism and secondary metabolism enzymes

    Collaboration Mathieu Rouard (Bioversity)

  • Conclusion

    Crucial stepping-stone for genetic improvement of this under-researched vital crop

    An essential bridge for genes and genomes evolution studies within Monocotyledons and with Dicotyledons

    The reference whole genome sequence of banana

    http://banana-genome.cirad.fr/

    http://southgreen.cirad.fr/http://southgreen.cirad.fr/http://southgreen.cirad.fr/

  • A new template for banana genetics

    ✔ Unlimited (almost) source of DNA markers located on the chromosomes - SSR: from several hundreds now several thousands located on the chromosomes - SNP: a template for SNP discovery and mapping for genetic and diversity studies

    ✔ A template to characterize chromosome structural variations = insertion, deletion, inversion, duplication, translocation and subsequent prospects for germplasm improvement

    ✔Access for the first time to the entire set of Musa genes (36 542) - Template for transcriptomic analysis - Candidate gene strategies based on physiological studies and insights from other species ; freeway to crosscutting evidence

  • Nabila Yahiaoui Franc-Christophe Baurens Françoise Carreel Olivier Garsmeur Stéphanie Bocs Gaetan Droc Céline Cardi Marlène Souquet Cyril Jourda Juliette Lengelle Marguerite Rodier Didier Mbéguié Matthieu Chabannes Rémy Habas Ronan Rivallan Philippe Francois Claire Poiron Christophe Jenny Frédéric Bakry Steeve Joseph Anne Dievart Julie Leclercq Xavier Argout Ange-Marie Risterucci Manuel Ruiz Jean Christophe Glaszmann

    Patrick Wincker France Denoeud Jean-Marc Aury Benjamin Noel Corinne Da Silva Kamel Jabbari Julie Poulain Karine Labadie Adriana Alberti Maria Bernard Margot Correa Olivier Jaillon Jean Weissenbach

    Mathieu Rouard (Bioversity, France) Valentin Guignon (Gene families analysis)

    Thomas Wicker (Univ Zurich, Switzerland) Eva Hribova (IEB, Czech Republic) Jaroslav Dolezel (IEB, Czech Republic) Pat Heslop-Harrison (Univ Leicester, UK) Olivier Panaud (Univ Perpignan, France) José Barbosa (Univ Rio Grande, Brazil) Dheema Burthia (AREU, Mauritius) Mouna Jeridi (IRA, Tunisia) (Transposable element analysis +FISH)

    Andrzej Kilian (DArT, Canberra, Australia) (DArT Developpment and genotyping)

    Nicolas Roux (Bioversity, France) Gert Kema (PRI, Wageningen) (provide MAMB BAC-end sequence)

    Dutch Groene Woudt Foundation

    Diane Burgess Mike Freeling (Univ Berkley, USA) (CNS analysis)

    Michael McKai Saravanaraj Ayyampalayam Jim Leebens-Mack (Univ Georgia, USA) (Phylogenomic analysis)

    Eric Lyons (Univ Arizona, USA) (COGE tools)

    Francis Quetier

    Spencer Brown (ISV, Gif sur Yvette) (genome size)