17
ARTICLES https://doi.org/10.1038/s41588-018-0041-z Pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice .Qiang Zhao 1 , Qi Feng  1 , Hengyun Lu 1 , Yan Li 1 , Ahong Wang 1 , Qilin Tian 1 , Qilin Zhan 1 , Yiqi Lu 1 , Lei Zhang 1 , Tao Huang 1 , Yongchun Wang 1 , Danlin Fan 1 , Yan Zhao 1 , Ziqun Wang 1 , Congcong Zhou 1 , Jiaying Chen 1 , Chuanrang Zhu 1 , Wenjun Li 1 , Qijun Weng 1 , Qun Xu 2 , Zi-Xuan Wang 1 , Xinghua Wei 2 , Bin Han 1 and Xuehui Huang  1,3 * 1 National Center for Gene Research, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China. 2 State Key Laboratory of Rice Biology, China National Rice Research Institute, Chinese Academy of Agricultural Sciences, Hangzhou, China. 3 College of Life and Environmental Sciences, Shanghai Normal University, Shanghai, China. Qiang Zhao and Qi Feng contributed equally to this work. *e-mail: [email protected] SUPPLEMENTARY INFORMATION In the format provided by the authors and unedited. NATURE GENETICS | www.nature.com/naturegenetics © 2018 Nature America Inc., part of Springer Nature. All rights reserved.

Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Embed Size (px)

Citation preview

Page 1: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Articleshttps://doi.org/10.1038/s41588-018-0041-z

Pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice.Qiang Zhao1, Qi Feng   1, Hengyun Lu1, Yan Li1, Ahong Wang1, Qilin Tian1, Qilin Zhan1, Yiqi Lu1, Lei Zhang1, Tao Huang1, Yongchun Wang1, Danlin Fan1, Yan Zhao1, Ziqun Wang1, Congcong Zhou1, Jiaying Chen1, Chuanrang Zhu1, Wenjun Li1, Qijun Weng1, Qun Xu2, Zi-Xuan Wang1, Xinghua Wei2, Bin Han1 and Xuehui Huang   1,3*

1National Center for Gene Research, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China. 2State Key Laboratory of Rice Biology, China National Rice Research Institute, Chinese Academy of Agricultural Sciences, Hangzhou, China. 3College of Life and Environmental Sciences, Shanghai Normal University, Shanghai, China. Qiang Zhao and Qi Feng contributed equally to this work. *e-mail: [email protected]

SUPPLEMENTARY INFORMATION

In the format provided by the authors and unedited.

NATure GeNeTiCs | www.nature.com/naturegenetics

© 2018 Nature America Inc., part of Springer Nature. All rights reserved.

Page 2: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Or-I

Or-II

Or-III

W3095-2

W1777

W1739

W0141 W2012

W1943

W1979

W1687

W3105-2

W0128

W3078-2

W0170

W1698

W1754

W0123-1

Supplementary Figure 1 Neighbor-joining tree of 446 accessions of wild rice O. rufipogon. The

phylogenetically representative accessions (colored in green) are selected based on the phylogenetic

relationships.

Page 3: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Tropical japonica

Temperate japonica

Indica

Aus

GP3

GP772-1

HP327

HP362-2

GLA4

HP486

HP396

HP274

HP263

HP492

GP124

GP62

GP540

GP39 GP536

GP688

GP761-1HP98

GP669

HP390

HP91-2

HP48 HP314

HP13-2

GP677

HP45

HP103

HP14

GP51

GP104

GP567

GP72

GP77

GP551

HP44

HP119

HP517-1

HP383

HP407

HP577

HP38

GP22

Supplementary Figure 2 Neighbor-joining tree of 950 accessions of cultivated rice O. sativa.

The phylogenetically representative accessions (colored in green) are selected based on the

phylogenetic relationships.

Page 4: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

a

b

Rice paired- end reads

Contigs

Pre-scaffolds

Contigs

SOAPdenovo2 (pregragh, contig)Fermi

Scaffolds

Gapcloser

Final assembly

Merge

SOAPdenovo2 (map, scaff)

Assemblied Genomes

Masked genomes

RepeatMasker

Gene models

Quality control

Blast,

ClustalGLA4 BAC sequences

FgeneSH,

Interpro

Collinear sequence pairs

Nipponbare reference

sequences

MUMmer

Sequence polymorphisms

(SNP, indel, NHS, large SV)

Diffseq, Clustal

Merged tables of sequence

polymorphisms

Perl, C++Orthologue

genesGenes absent in Nipponbare

reference

RNAseq of W1943 & GLA4

Smalt

Blast

Supplementary Figure 3 The route map of de novo assembly and follow-up analysis for the 66

accessions. (a) The procedure for whole-genome de novo assembling. (b) The procedure for BAC-

based validation, genome annotations and detections of whole-genome variants.

Page 5: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Repeats

GLA4 assembly

Gla4_BAC_ctgs

TATATATATATATATATATATATATATATATAGTTTCTTTTTAAAAAAACAATTTAACCG|||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||TATATATATATATATATATATATATATATAAAGTTTCTTTTTAAAAAAACAATTTAACCG

AGGGAAACCAATATGGATACAATCCGAATAGTCCTTGTTCGTTTCCGTGTAGAACTCTGG||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||AGGGAAACCAATATGGATACAATCCGAATAGTCCTTG-TCGTTTCCGTGTAGAACTCTGG

Minor mis-assembling

False “T”insertion, Low quailty

Scaffold_9409

Gla4_BAC_ctg_0662069-145584

G/C

Read

depth

T

-

-

TC

-

GG

T

A

T

-

-

AARepeats

AAACAACTAAAATATAGA—GGGGGGGGGAGGTTTTGAACTTCAAATATTGGGGAATTTT|||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||AAACAACTAAAATATAGAGGGGGGGGGGGAGGTTTTGAACTTCAAATATTGGGGAATTTT

Minor mis-assembling

Poly- “G” Low quailty

a

b

c

d e

f

Scaffold_9409

Gla4_BAC_ctg_06 Scaffold_9409

Gla4_BAC_ctg_06

Scaffold_9409

Gla4_BAC_ctg_06

Heterozygous, AT repeatsT/A=19/37

Supplementary Figure 4 Quality evaluation of GLA4 genome assembly using BAC-based sequences. (a) The physical

locations of 87 GLA4 contigs (merged by 273 BACs through sanger-based sequencing) on chromosome 4 and the

corresponding scaffolds in the GLA4 assembly of the rice pan genome data, colored in blue and green, respectively. Repeats in

red show the RepeatMasker-annotated transposable elements and other repeats. (b) The local genomic region of a 3-Mb interval

on chromosome 4. (c) Comparison between a scaffold in the GLA4 assembly and the corresponding BAC-based sequences.

The grey blocks show highly consistent regions between Sanger-based BAC and the scaffold. The inconsistent bases between

them were indicated. (d)-(f) Three examples of “low-quality” variants. The raw Illumina reads of GLA4 were aligned against

the assembly of itself. The “low-quality” variants were mostly due to assembling errors from multiple reads with “heterozygous

genotypes”, especially in the simple sequence repeat regions.

Page 6: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Supplementary Figure 5 The distribution of error rates in Nipponbare assembly across 12 rice chromosomes.

All the contigs of the Nipponbare assembly were aligned with the Nipponbare reference (version IRGSP4), and

the number of errors per base was calculated for each 100-kb window according to the sequence variants

between them.

Err

or ra

te (%

)

Page 7: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

0

10

20

30

40

50

600~

0.00

5

0.00

5~0.

01

0.01

~0.0

15

0.01

5~0.

02

0.02

~0.0

25

0.02

5~0.

03

0.03

~0.0

35

0.03

5~0.

04

0.04

~0.0

45

0.04

5~0.

05

0.05

~0.0

55

0.05

5~0.

06

0.06

~0.0

65

0.06

5~0.

07

0.07

~0.0

75

0.07

5~0.

08

0.08

~0.0

85

0.08

5~0.

09

0.09

~0.0

95

0.09

5~0.

1

>=0.

1

Tota

l siz

e (M

b)

Error rate (%)

Supplementary Figure 6 The summary statistic of genomic regions with difference

error rates in Nipponbare assembly. A small fraction (12Mb, ~3%) of the rice genome

were enriched with errors (>0.1% errors per base) while most of the other regions have

much fewer errors.

Page 8: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Total

Supplementary Figure 7 The proportion of various kinds of genetic variants identified in this

study.

In gene coding region

Page 9: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Supplementary Figure 8 The proportion of sequence variants from the low quality sites in 66 rice genomes.

Page 10: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Supplementary Figure 9 Whole-genome screening of domestication sweeps using the pan-genome based variants. The

values of πw/πc are plotted against the position on each chromosome. The horizontal dashed line indicates the

genome-wide threshold of selection signals (πw/πc > 3). The six domestication sweeps identified using this pan-

genome data but missed in the previous results (low-coverage re-sequencing of 1,529 accessions) was indicated.

πw

/πc

Page 11: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Supplementary Figure 10 Rooted neighbor-joining trees of 66 accessions using O. barthii W3106 as outgroup.

Bootstrap values, which are estimated with 100 replicates to assess the relative support for each branch, are

indicated on the trees. The three domestication sweeps on chromosome 3, chromosome 6 and chromosome 8

were identified from genome-wide domestication selection scan, controlling panicle length, grain weight and

stigma exsertion, respectively. The aus accessions that were not included within the cultivated rice clade were

indicated by arrows.

Oryza sativa indica

Oryza sativa (Temperate japonica)

Oryza sativa (Tropical japonica)

Oryza sativa (Aromatic japonica)

Oryza sativa aus

Oryza rufipogon

Bh4 +/- 10 kb Prog1 +/- 10 kbChr07: 2,862,131~2,881,628

An1 +/- 10 kb An2 +/- 10 kbChr04: 23,358,384~23,380,826 Chr04: 16,742,716~16,766,314 Chr04: 26,347,952~26,372,057

Chr3: 100k 26,900,001~27,000,000

Chr6: 100k 28,500,001~28,600,000

Chr8: 100k 23,900,001~24,000,000

GP22 HP492 KY131 GP567 HP390 HP13-2 LG31 GP77 Koshihikari IL9 DHX2 HP517-1 GP669 HP383 GP677 GP62 GP536 HP14 GP51 GP640 HP396 HP486 HP274 HP362-2 WYG7 GP772-1 HP38 HP44 GP39 HP48 UR28 GP104 HP103 HP98 Kasalath W3105-1 GP551 HP91-2 GP124 GP295-1 HP314 HP45 GP761-1 HP327 HP263 GP540 HP577 HP119 HP407 GLA4 GP72 GP3 W1979 W1687 W3078-2 W1943 W3095-2 W0128 W1777 W1698 W0123-1 W1754 W0170 W1739 W2012 W0141 W3106 (Outgroup)

50

90

77

67

7798

89

100

73

75

71

9797

95

71

6766

5083

62100

79

84

77

73

76

100

100

97

100

74

100

93

100

100

100

100

100

100

100

100

GP772-1 DHX2 GP669 GP677 WYG7 Koshihikari IL9 GP3 HP407 HP274 GP72 HP327 HP383 HP314 HP362-2 GP567 HP45 HP13-2 GLA4 HP44 HP48 GP77 GP761-1 GP640 HP103 HP396 LG31 GP551 UR28 HP38 HP14 KY131 HP577 HP98 HP390 W3095-2 GP39 GP536 GP104 GP295-1 GP62 GP22 GP124 GP51 HP517-1 GP540 HP486 HP492 HP91-2 HP119 HP263 W1687 W1943 W1777 W3078-2 W0123-1 W3105-1 W0128 Kasalath W1698 W1979 W1754 W0170 W0141 W1739 W2012 W3106 (Outgroup)

8560

100

6065

97

100

71

77

96

56

8388

8976

97

73

100

93

59

80

60

78

96

96

71

88

68

58

100

10081

89

96

86

81

95

89

100

100

67

75

89

100

100

57

100

54

100

10094

10079

10079

51

58

100

100

UR28 HP13-2 GP124 W3105-1 GP104 Kasalath W0141 HP48 GP540 HP263 GP640 HP362-2 HP274 HP119 GP761-1 GLA4 GP295-1 HP314 HP577 HP492 GP39 GP536 GP62 HP45 HP14 GP77 HP396 HP91-2 GP567 HP98 KY131 GP669 LG31 Koshihikari HP517-1 W1687 W1698 W0128 GP22 GP772-1 GP51 GP72 GP3 HP327 GP677 HP407 HP486 HP383 HP103 IL9 GP551 WYG7 HP390 DHX2 HP44 HP38 W3095-2 W1943 W3078-2 W0123-1 W0170 W1754 W1979 W2012 W1739 W1777 W3106 (Outgroup)

10076

98

98

81

97

100

93

100

77

96

92

96

98

76

72

79

72

10097

8994

71

76

62100

9489

9598

94

58

58

52

90

58

56

58

75

56

94

9594

53

54

58

54

58

100

58

9458

94

58

58

100

W1777 GP3 KY131 DHX2 UR28 HP45 GP551 HP98 HP103 GP669 HP44 HP390 HP38 Koshihikari GP104 HP48 HP314 GP567 GP124 W1687 LG31 GP39 GP77 GP536 GP677 WYG7 IL9 HP91-2 HP13-2 GP761-1 HP14 GP295-1 HP577 HP119 HP327 HP263 HP274 GP772-1 GP640 HP492 HP383 HP407 GP51 HP486 GLA4 HP396 GP72 HP362-2 GP540 HP517-1 GP22 W0128 W3095-2 W3078-2 W1943 W1979 W2012 W0141 W1739 W0123-1 W3105-1 GP62 Kasalath W1754 W1698 W0170 W3106 (Outgroup)

10079

92

83

97

60

9872

67

57

74

99

84

91

80

100

10099

79

78

90

90

8990

100

100

99

100

97

100

50

100100

100

100

100

100

100

100

100

100

DHX2 HP13-2 HP44 HP103 HP91-2 UR28 GP124 GP104 GP77 HP14 GP536 GP669 LG31 GP567 IL9 GP761-1 GP551 HP390 GP677 Koshihikari GP772-1 HP98 HP45 HP492 HP263 W0128 HP38 WYG7 GP640 HP314 HP48 HP407 GLA4 HP274 HP486 GP3 HP396 GP22 GP51 GP72 HP119 HP517-1 HP383 GP62 Kasalath W3105-1 GP295-1 HP327 HP577 HP362-2 GP540 GP39 KY131 W1943 W3078-2 W3095-2 W1777 W0170 W1754 W0123-1 W1979 W1687 W1698 W0141 W1739 W2012 W3106 (Outgroup)

9297

99

50

100

95

63100

89

98

62

80

100

10077

67

100100

66100

82

87

86

100

9399

100

89

83

84

100

100

100

99

58

100

100

100

100100

100

94

100

96

100

100

HP327 HP407 HP486 HP119 HP396 HP517-1 W1687 GP540 IL9 DHX2 Koshihikari GP677 LG31 GP669 GP567 HP38 HP390 HP44 GP761-1 GP39 UR28 GP295-1 GP124 HP91-2 GP536 GP77 KY131 HP48 HP14 HP314 GP51 HP263 GP551 HP98 HP274 HP383 WYG7 HP45 HP577 HP492 GLA4 HP362-2 GP22 GP772-1 W0128 GP3 HP13-2 HP103 W1943 GP72 W3095-2 GP640 W3078-2 W3105-1 W1754 Kasalath GP62 W0123-1 W2012 GP104 W0170 W1698 W1979 W1777 W0141 W1739 W3106 (Outgroup)

8984

8197

90

76

51

100

9152

90

10067

100100

100

73

75

85

59

9870

100

100

55

92100

9884

55

100

100

100

100100

100

100

8994

94100

100100

10085

100

84

100

100

100

LG31 HP14 GP567 HP91-2 HP38 KY131 HP48 WYG7 HP407 HP362-2 HP486 HP577 HP383 HP517-1 GP51 GP772-1 GP3 HP263 GP22 HP119 HP274 GLA4 HP396 HP492 HP327 GP540 Kasalath GP62 GP104 GP124 DHX2 UR28 GP295-1 GP77 GP39 GP640 GP536 HP45 HP13-2 IL9 HP98 HP103 GP551 GP669 Koshihikari GP677 HP390 GP761-1 HP44 HP314 W1979 W1777 W0128 W1687 W3078-2 W1943 W3095-2 W3105-1 GP72 W0123-1 W1698 W0170 W1754 W1739 W2012 W0141 W3106 (Outgroup)

100

100

100100

100

100

100

100

100

100

90

88

95

97

98

82100

84

97

91

94

95

100

93

95

100

9995

100

90

92

98

95100

100

90

71

75

99

100

92

100

98

100

85

10070

100

9494

98

97

83

98

96100

100

100100

100

86

100

100

100

100

Page 12: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

GP39

GP77

GP536

GP640

GP761-1OsTT1

OsSPL13

Supplementary Figure 11 Footprints of the introgression from indica to tropical japonica. For 807,139 SNP

sites with highly-differentiated alleles between indica and temperate japonica, the whole-genome SNP allele

information in each accession of tropical japonica are indicated along 12 rice chromosomes, with indica-

specific allele colored in red and temperate japonica-specific allele colored in blue.

Page 13: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Supplementary Figure 1 2 Naturally occurring variation in the gene sd-g that has been

identified previously through mutant mapping.

19,877,000 19,878,000

Nipponbare

GP536GP761-1HP38Kasalath

(Sd-g: Os05g0407500)

Insertion

Exon

Reference type Alternative type

ATG

ATG Start codonUTR

19,875,000 19,876,000

--

T-CGCG

CG-

A-

--

Deletion

GG-

GGGG

CA-

CACA

Page 14: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

0

5

10

15

20

0 5 10 15

0

5

10

15

20

25

0 5 10 15

0

5

10

15

20

25

0 5 10 15

0

5

10

15

20

0 5 10 15

Supplementary Figure 13 Comparisons of the expression levels in four tissues of GLA4 between the genes in Nipponbare reference and

the “newly-identified” genes in GLA4 genome.

GLA4 Leaf

Log2RPKM

Pro

porti

on (%

)

Pro

porti

on (%

)

Pro

porti

on (%

)

Pro

porti

on (%

)Log2RPKM

Log2RPKM Log2RPKM

GLA4 Seedling

GLA4 Root

GLA4 spike

Genes in Nipponbare referene Genes in GLA4 assemblies but absent in Nipponbare reference

Page 15: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Co

un

t S

ing

leto

ns

, C

lus

ters

, a

nd

To

tal

-

500,000

1,000,000

1,500,000

2,000,000

2,500,000

-

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

0 10 20 30 40 50 60

Number of Singletons

Number of Clusters

Total of Singletons and Clusters

Number of Input Genes

0

Co

un

t G

en

es

0

Supplementary Figure 14 Evaluation of the number of protein coding genes the rice pan genome. Stepwise addition of rice

accessions from n = 2 to n = 67 was performed to evaluate the number of coding genes in the rice genome. In each independent

run from n = 2 to n = 67, clusters are defined as coding genes present in at least two rice accessions, and singletons are defined

as coding genes present in only one of the rice accessions.

Page 16: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Supplementary Figure 15 The number of coding genes mainly present in one group but absent

in all other groups.

Page 17: Pan-genome analysis highlights the extent of genomic …10.1038/s41588-018-0041... · Pan-genome analysis highlights the extent of ... The local genomic region of a 3-Mb interval

Supplementary Figure 16 The proportion of annotated genes with different protein domains in InterPro in the core set and the pan set.