Upload
buithien
View
214
Download
2
Embed Size (px)
Citation preview
Articleshttps://doi.org/10.1038/s41588-018-0041-z
Pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice.Qiang Zhao1, Qi Feng 1, Hengyun Lu1, Yan Li1, Ahong Wang1, Qilin Tian1, Qilin Zhan1, Yiqi Lu1, Lei Zhang1, Tao Huang1, Yongchun Wang1, Danlin Fan1, Yan Zhao1, Ziqun Wang1, Congcong Zhou1, Jiaying Chen1, Chuanrang Zhu1, Wenjun Li1, Qijun Weng1, Qun Xu2, Zi-Xuan Wang1, Xinghua Wei2, Bin Han1 and Xuehui Huang 1,3*
1National Center for Gene Research, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China. 2State Key Laboratory of Rice Biology, China National Rice Research Institute, Chinese Academy of Agricultural Sciences, Hangzhou, China. 3College of Life and Environmental Sciences, Shanghai Normal University, Shanghai, China. Qiang Zhao and Qi Feng contributed equally to this work. *e-mail: [email protected]
SUPPLEMENTARY INFORMATION
In the format provided by the authors and unedited.
NATure GeNeTiCs | www.nature.com/naturegenetics
© 2018 Nature America Inc., part of Springer Nature. All rights reserved.
Or-I
Or-II
Or-III
W3095-2
W1777
W1739
W0141 W2012
W1943
W1979
W1687
W3105-2
W0128
W3078-2
W0170
W1698
W1754
W0123-1
Supplementary Figure 1 Neighbor-joining tree of 446 accessions of wild rice O. rufipogon. The
phylogenetically representative accessions (colored in green) are selected based on the phylogenetic
relationships.
Tropical japonica
Temperate japonica
Indica
Aus
GP3
GP772-1
HP327
HP362-2
GLA4
HP486
HP396
HP274
HP263
HP492
GP124
GP62
GP540
GP39 GP536
GP688
GP761-1HP98
GP669
HP390
HP91-2
HP48 HP314
HP13-2
GP677
HP45
HP103
HP14
GP51
GP104
GP567
GP72
GP77
GP551
HP44
HP119
HP517-1
HP383
HP407
HP577
HP38
GP22
Supplementary Figure 2 Neighbor-joining tree of 950 accessions of cultivated rice O. sativa.
The phylogenetically representative accessions (colored in green) are selected based on the
phylogenetic relationships.
a
b
Rice paired- end reads
Contigs
Pre-scaffolds
Contigs
SOAPdenovo2 (pregragh, contig)Fermi
Scaffolds
Gapcloser
Final assembly
Merge
SOAPdenovo2 (map, scaff)
Assemblied Genomes
Masked genomes
RepeatMasker
Gene models
Quality control
Blast,
ClustalGLA4 BAC sequences
FgeneSH,
Interpro
Collinear sequence pairs
Nipponbare reference
sequences
MUMmer
Sequence polymorphisms
(SNP, indel, NHS, large SV)
Diffseq, Clustal
Merged tables of sequence
polymorphisms
Perl, C++Orthologue
genesGenes absent in Nipponbare
reference
RNAseq of W1943 & GLA4
Smalt
Blast
Supplementary Figure 3 The route map of de novo assembly and follow-up analysis for the 66
accessions. (a) The procedure for whole-genome de novo assembling. (b) The procedure for BAC-
based validation, genome annotations and detections of whole-genome variants.
Repeats
GLA4 assembly
Gla4_BAC_ctgs
TATATATATATATATATATATATATATATATAGTTTCTTTTTAAAAAAACAATTTAACCG|||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||TATATATATATATATATATATATATATATAAAGTTTCTTTTTAAAAAAACAATTTAACCG
AGGGAAACCAATATGGATACAATCCGAATAGTCCTTGTTCGTTTCCGTGTAGAACTCTGG||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||AGGGAAACCAATATGGATACAATCCGAATAGTCCTTG-TCGTTTCCGTGTAGAACTCTGG
Minor mis-assembling
False “T”insertion, Low quailty
Scaffold_9409
Gla4_BAC_ctg_0662069-145584
G/C
Read
depth
T
-
-
TC
-
GG
T
A
T
-
-
AARepeats
AAACAACTAAAATATAGA—GGGGGGGGGAGGTTTTGAACTTCAAATATTGGGGAATTTT|||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||AAACAACTAAAATATAGAGGGGGGGGGGGAGGTTTTGAACTTCAAATATTGGGGAATTTT
Minor mis-assembling
Poly- “G” Low quailty
a
b
c
d e
f
Scaffold_9409
Gla4_BAC_ctg_06 Scaffold_9409
Gla4_BAC_ctg_06
Scaffold_9409
Gla4_BAC_ctg_06
Heterozygous, AT repeatsT/A=19/37
Supplementary Figure 4 Quality evaluation of GLA4 genome assembly using BAC-based sequences. (a) The physical
locations of 87 GLA4 contigs (merged by 273 BACs through sanger-based sequencing) on chromosome 4 and the
corresponding scaffolds in the GLA4 assembly of the rice pan genome data, colored in blue and green, respectively. Repeats in
red show the RepeatMasker-annotated transposable elements and other repeats. (b) The local genomic region of a 3-Mb interval
on chromosome 4. (c) Comparison between a scaffold in the GLA4 assembly and the corresponding BAC-based sequences.
The grey blocks show highly consistent regions between Sanger-based BAC and the scaffold. The inconsistent bases between
them were indicated. (d)-(f) Three examples of “low-quality” variants. The raw Illumina reads of GLA4 were aligned against
the assembly of itself. The “low-quality” variants were mostly due to assembling errors from multiple reads with “heterozygous
genotypes”, especially in the simple sequence repeat regions.
Supplementary Figure 5 The distribution of error rates in Nipponbare assembly across 12 rice chromosomes.
All the contigs of the Nipponbare assembly were aligned with the Nipponbare reference (version IRGSP4), and
the number of errors per base was calculated for each 100-kb window according to the sequence variants
between them.
Err
or ra
te (%
)
0
10
20
30
40
50
600~
0.00
5
0.00
5~0.
01
0.01
~0.0
15
0.01
5~0.
02
0.02
~0.0
25
0.02
5~0.
03
0.03
~0.0
35
0.03
5~0.
04
0.04
~0.0
45
0.04
5~0.
05
0.05
~0.0
55
0.05
5~0.
06
0.06
~0.0
65
0.06
5~0.
07
0.07
~0.0
75
0.07
5~0.
08
0.08
~0.0
85
0.08
5~0.
09
0.09
~0.0
95
0.09
5~0.
1
>=0.
1
Tota
l siz
e (M
b)
Error rate (%)
Supplementary Figure 6 The summary statistic of genomic regions with difference
error rates in Nipponbare assembly. A small fraction (12Mb, ~3%) of the rice genome
were enriched with errors (>0.1% errors per base) while most of the other regions have
much fewer errors.
Total
Supplementary Figure 7 The proportion of various kinds of genetic variants identified in this
study.
In gene coding region
Supplementary Figure 8 The proportion of sequence variants from the low quality sites in 66 rice genomes.
Supplementary Figure 9 Whole-genome screening of domestication sweeps using the pan-genome based variants. The
values of πw/πc are plotted against the position on each chromosome. The horizontal dashed line indicates the
genome-wide threshold of selection signals (πw/πc > 3). The six domestication sweeps identified using this pan-
genome data but missed in the previous results (low-coverage re-sequencing of 1,529 accessions) was indicated.
πw
/πc
Supplementary Figure 10 Rooted neighbor-joining trees of 66 accessions using O. barthii W3106 as outgroup.
Bootstrap values, which are estimated with 100 replicates to assess the relative support for each branch, are
indicated on the trees. The three domestication sweeps on chromosome 3, chromosome 6 and chromosome 8
were identified from genome-wide domestication selection scan, controlling panicle length, grain weight and
stigma exsertion, respectively. The aus accessions that were not included within the cultivated rice clade were
indicated by arrows.
Oryza sativa indica
Oryza sativa (Temperate japonica)
Oryza sativa (Tropical japonica)
Oryza sativa (Aromatic japonica)
Oryza sativa aus
Oryza rufipogon
Bh4 +/- 10 kb Prog1 +/- 10 kbChr07: 2,862,131~2,881,628
An1 +/- 10 kb An2 +/- 10 kbChr04: 23,358,384~23,380,826 Chr04: 16,742,716~16,766,314 Chr04: 26,347,952~26,372,057
Chr3: 100k 26,900,001~27,000,000
Chr6: 100k 28,500,001~28,600,000
Chr8: 100k 23,900,001~24,000,000
GP22 HP492 KY131 GP567 HP390 HP13-2 LG31 GP77 Koshihikari IL9 DHX2 HP517-1 GP669 HP383 GP677 GP62 GP536 HP14 GP51 GP640 HP396 HP486 HP274 HP362-2 WYG7 GP772-1 HP38 HP44 GP39 HP48 UR28 GP104 HP103 HP98 Kasalath W3105-1 GP551 HP91-2 GP124 GP295-1 HP314 HP45 GP761-1 HP327 HP263 GP540 HP577 HP119 HP407 GLA4 GP72 GP3 W1979 W1687 W3078-2 W1943 W3095-2 W0128 W1777 W1698 W0123-1 W1754 W0170 W1739 W2012 W0141 W3106 (Outgroup)
50
90
77
67
7798
89
100
73
75
71
9797
95
71
6766
5083
62100
79
84
77
73
76
100
100
97
100
74
100
93
100
100
100
100
100
100
100
100
GP772-1 DHX2 GP669 GP677 WYG7 Koshihikari IL9 GP3 HP407 HP274 GP72 HP327 HP383 HP314 HP362-2 GP567 HP45 HP13-2 GLA4 HP44 HP48 GP77 GP761-1 GP640 HP103 HP396 LG31 GP551 UR28 HP38 HP14 KY131 HP577 HP98 HP390 W3095-2 GP39 GP536 GP104 GP295-1 GP62 GP22 GP124 GP51 HP517-1 GP540 HP486 HP492 HP91-2 HP119 HP263 W1687 W1943 W1777 W3078-2 W0123-1 W3105-1 W0128 Kasalath W1698 W1979 W1754 W0170 W0141 W1739 W2012 W3106 (Outgroup)
8560
100
6065
97
100
71
77
96
56
8388
8976
97
73
100
93
59
80
60
78
96
96
71
88
68
58
100
10081
89
96
86
81
95
89
100
100
67
75
89
100
100
57
100
54
100
10094
10079
10079
51
58
100
100
UR28 HP13-2 GP124 W3105-1 GP104 Kasalath W0141 HP48 GP540 HP263 GP640 HP362-2 HP274 HP119 GP761-1 GLA4 GP295-1 HP314 HP577 HP492 GP39 GP536 GP62 HP45 HP14 GP77 HP396 HP91-2 GP567 HP98 KY131 GP669 LG31 Koshihikari HP517-1 W1687 W1698 W0128 GP22 GP772-1 GP51 GP72 GP3 HP327 GP677 HP407 HP486 HP383 HP103 IL9 GP551 WYG7 HP390 DHX2 HP44 HP38 W3095-2 W1943 W3078-2 W0123-1 W0170 W1754 W1979 W2012 W1739 W1777 W3106 (Outgroup)
10076
98
98
81
97
100
93
100
77
96
92
96
98
76
72
79
72
10097
8994
71
76
62100
9489
9598
94
58
58
52
90
58
56
58
75
56
94
9594
53
54
58
54
58
100
58
9458
94
58
58
100
W1777 GP3 KY131 DHX2 UR28 HP45 GP551 HP98 HP103 GP669 HP44 HP390 HP38 Koshihikari GP104 HP48 HP314 GP567 GP124 W1687 LG31 GP39 GP77 GP536 GP677 WYG7 IL9 HP91-2 HP13-2 GP761-1 HP14 GP295-1 HP577 HP119 HP327 HP263 HP274 GP772-1 GP640 HP492 HP383 HP407 GP51 HP486 GLA4 HP396 GP72 HP362-2 GP540 HP517-1 GP22 W0128 W3095-2 W3078-2 W1943 W1979 W2012 W0141 W1739 W0123-1 W3105-1 GP62 Kasalath W1754 W1698 W0170 W3106 (Outgroup)
10079
92
83
97
60
9872
67
57
74
99
84
91
80
100
10099
79
78
90
90
8990
100
100
99
100
97
100
50
100100
100
100
100
100
100
100
100
100
DHX2 HP13-2 HP44 HP103 HP91-2 UR28 GP124 GP104 GP77 HP14 GP536 GP669 LG31 GP567 IL9 GP761-1 GP551 HP390 GP677 Koshihikari GP772-1 HP98 HP45 HP492 HP263 W0128 HP38 WYG7 GP640 HP314 HP48 HP407 GLA4 HP274 HP486 GP3 HP396 GP22 GP51 GP72 HP119 HP517-1 HP383 GP62 Kasalath W3105-1 GP295-1 HP327 HP577 HP362-2 GP540 GP39 KY131 W1943 W3078-2 W3095-2 W1777 W0170 W1754 W0123-1 W1979 W1687 W1698 W0141 W1739 W2012 W3106 (Outgroup)
9297
99
50
100
95
63100
89
98
62
80
100
10077
67
100100
66100
82
87
86
100
9399
100
89
83
84
100
100
100
99
58
100
100
100
100100
100
94
100
96
100
100
HP327 HP407 HP486 HP119 HP396 HP517-1 W1687 GP540 IL9 DHX2 Koshihikari GP677 LG31 GP669 GP567 HP38 HP390 HP44 GP761-1 GP39 UR28 GP295-1 GP124 HP91-2 GP536 GP77 KY131 HP48 HP14 HP314 GP51 HP263 GP551 HP98 HP274 HP383 WYG7 HP45 HP577 HP492 GLA4 HP362-2 GP22 GP772-1 W0128 GP3 HP13-2 HP103 W1943 GP72 W3095-2 GP640 W3078-2 W3105-1 W1754 Kasalath GP62 W0123-1 W2012 GP104 W0170 W1698 W1979 W1777 W0141 W1739 W3106 (Outgroup)
8984
8197
90
76
51
100
9152
90
10067
100100
100
73
75
85
59
9870
100
100
55
92100
9884
55
100
100
100
100100
100
100
8994
94100
100100
10085
100
84
100
100
100
LG31 HP14 GP567 HP91-2 HP38 KY131 HP48 WYG7 HP407 HP362-2 HP486 HP577 HP383 HP517-1 GP51 GP772-1 GP3 HP263 GP22 HP119 HP274 GLA4 HP396 HP492 HP327 GP540 Kasalath GP62 GP104 GP124 DHX2 UR28 GP295-1 GP77 GP39 GP640 GP536 HP45 HP13-2 IL9 HP98 HP103 GP551 GP669 Koshihikari GP677 HP390 GP761-1 HP44 HP314 W1979 W1777 W0128 W1687 W3078-2 W1943 W3095-2 W3105-1 GP72 W0123-1 W1698 W0170 W1754 W1739 W2012 W0141 W3106 (Outgroup)
100
100
100100
100
100
100
100
100
100
90
88
95
97
98
82100
84
97
91
94
95
100
93
95
100
9995
100
90
92
98
95100
100
90
71
75
99
100
92
100
98
100
85
10070
100
9494
98
97
83
98
96100
100
100100
100
86
100
100
100
100
GP39
GP77
GP536
GP640
GP761-1OsTT1
OsSPL13
Supplementary Figure 11 Footprints of the introgression from indica to tropical japonica. For 807,139 SNP
sites with highly-differentiated alleles between indica and temperate japonica, the whole-genome SNP allele
information in each accession of tropical japonica are indicated along 12 rice chromosomes, with indica-
specific allele colored in red and temperate japonica-specific allele colored in blue.
Supplementary Figure 1 2 Naturally occurring variation in the gene sd-g that has been
identified previously through mutant mapping.
19,877,000 19,878,000
Nipponbare
GP536GP761-1HP38Kasalath
(Sd-g: Os05g0407500)
Insertion
Exon
Reference type Alternative type
ATG
ATG Start codonUTR
19,875,000 19,876,000
--
T-CGCG
CG-
A-
--
Deletion
GG-
GGGG
CA-
CACA
0
5
10
15
20
0 5 10 15
0
5
10
15
20
25
0 5 10 15
0
5
10
15
20
25
0 5 10 15
0
5
10
15
20
0 5 10 15
Supplementary Figure 13 Comparisons of the expression levels in four tissues of GLA4 between the genes in Nipponbare reference and
the “newly-identified” genes in GLA4 genome.
GLA4 Leaf
Log2RPKM
Pro
porti
on (%
)
Pro
porti
on (%
)
Pro
porti
on (%
)
Pro
porti
on (%
)Log2RPKM
Log2RPKM Log2RPKM
GLA4 Seedling
GLA4 Root
GLA4 spike
Genes in Nipponbare referene Genes in GLA4 assemblies but absent in Nipponbare reference
Co
un
t S
ing
leto
ns
, C
lus
ters
, a
nd
To
tal
-
500,000
1,000,000
1,500,000
2,000,000
2,500,000
-
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
0 10 20 30 40 50 60
Number of Singletons
Number of Clusters
Total of Singletons and Clusters
Number of Input Genes
0
Co
un
t G
en
es
0
Supplementary Figure 14 Evaluation of the number of protein coding genes the rice pan genome. Stepwise addition of rice
accessions from n = 2 to n = 67 was performed to evaluate the number of coding genes in the rice genome. In each independent
run from n = 2 to n = 67, clusters are defined as coding genes present in at least two rice accessions, and singletons are defined
as coding genes present in only one of the rice accessions.
Supplementary Figure 15 The number of coding genes mainly present in one group but absent
in all other groups.
Supplementary Figure 16 The proportion of annotated genes with different protein domains in InterPro in the core set and the pan set.