42
1 Classification: Biological Sciences – Plant Biology Seed Genome Hypomethylated Regions Are Enriched In Transcription Factor Genes Min Chen a , Jer-Young Lin a , Jungim Hur a , Julie M. Pelletier b , Russell Baden b1 , Matteo Pellegrini a , John J. Harada b , and Robert B. Goldberg a,2 a Department of Molecular, Cell, and Developmental Biology, University of California, Los Angeles, CA 90095; b Department of Plant Biology, College of Biological Sciences, University of California, Davis, CA 95616, USA. Author contributions: M.C., J.-Y.L., M.P., J.J.H., and R.B.G. designed research; M.C., J.-Y.L., J.H., J.M.P., and R.B. performed research; M.C., J.-Y.L., J.M.P., and R.B. analyzed data; R.B.G. wrote the paper; J.J.H. and M.P. provided comments on the manuscript. Running Title: Seed DNA Methylation Valleys Footnotes: 1 Present address: California Animal Health and Food Safety Laboratory, University of California, Davis, CA 95616. 2 To whom correspondence should be addressed: E-mail: [email protected]. Corresponding Author: Robert B. Goldberg, Department of Molecular, Cell, and Developmental Biology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095-7239; 310-825-9093; [email protected] Key Words seed development | DNA methylation valleys | transcription factor genes | soybean | Arabidopsis . CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted June 27, 2018. . https://doi.org/10.1101/356501 doi: bioRxiv preprint

Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

1

Classification: Biological Sciences – Plant Biology

Seed Genome Hypomethylated Regions Are Enriched In Transcription Factor Genes

Min Chena, Jer-Young Lina, Jungim Hura, Julie M. Pelletierb, Russell Badenb1, Matteo

Pellegrinia, John J. Haradab, and Robert B. Goldberga,2

aDepartment of Molecular, Cell, and Developmental Biology, University of California, Los

Angeles, CA 90095; bDepartment of Plant Biology, College of Biological Sciences, University

of California, Davis, CA 95616, USA.

Author contributions: M.C., J.-Y.L., M.P., J.J.H., and R.B.G. designed research; M.C., J.-Y.L.,

J.H., J.M.P., and R.B. performed research; M.C., J.-Y.L., J.M.P., and R.B. analyzed data; R.B.G.

wrote the paper; J.J.H. and M.P. provided comments on the manuscript.

Running Title: Seed DNA Methylation Valleys

Footnotes:

1Present address: California Animal Health and Food Safety Laboratory, University of

California, Davis, CA 95616. 2To whom correspondence should be addressed: E-mail: [email protected].

Corresponding Author: Robert B. Goldberg, Department of Molecular, Cell, and

Developmental Biology, University of California, Los Angeles, 610 Charles E. Young Drive

East, Los Angeles, CA 90095-7239; 310-825-9093; [email protected]

Key Words

seed development | DNA methylation valleys | transcription factor genes |

soybean | Arabidopsis

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 2: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

2

Significance We scanned soybean and Arabidopsis seed genomes for hypomethylated regions, or DNA

Methylation Valleys (DMVs), present in mammalian cells. A significant fraction of seed

genomes contain DMV regions that have < 5% bulk DNA methylation, or, in many cases, no

detectable DNA methylation. Methylation levels of seed DMVs do not vary detectably during

seed development with respect to time, region, and tissue, and are present prior to fertilization.

Seed DMVs are enriched in transcription factor genes and other genes critical for seed

development, and are also decorated with histone marks that fluctuate with developmental stage,

resembling in significant ways their animal counterparts. We conclude that many genes playing

important roles in seed formation are regulated in the absence of detectable DNA methylation

events, and suggest that selective action of transcriptional activators and repressors, as well as

chromatin epigenetic events play important roles in making a seed – particularly embryo

formation.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 3: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

3

Abstract

The precise mechanisms that control gene activity during seed development remain largely

unknown. Previously, we showed that several genes essential for seed development, including

those encoding storage proteins, fatty acid biosynthesis enzymes, and transcriptional regulators,

such as ABI3 and FUS3, are located within hypomethylated regions of the soybean genome.

These hypomethylated regions are similar to the DNA methylation valleys (DMVs), or canyons,

found in mammalian cells. Here, we address the question of the extent to which DMVs are

present within seed genomes, and what role they might play in seed development. We scanned

soybean and Arabidopsis seed genomes from post-fertilization through dormancy and

germination for regions that contain < 5% or < 0.4% bulk methylation in CG-, CHG-, and CHH-

contexts over all developmental stages. We found that DMVs represent extensive portions of

seed genomes, range in size from 5 to 76 kb, are scattered throughout all chromosomes, and are

hypomethylated throughout the plant life cycle. Significantly, DMVs are enriched greatly in

transcription factor genes, and other developmental genes, that play critical roles in seed

formation. Many DMV genes are regulated with respect to seed stage, region, and tissue – and

contain H3K4me3, H3K27me3, or bivalent marks that fluctuate during development. Our results

indicate that DMVs are a unique regulatory feature of both plant and animal genomes, and that a

large number of seed genes are regulated in the absence of methylation changes during

development – probably by the action of specific transcription factors and epigenetic events at

the chromatin level.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 4: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

4

Introduction

Seeds are the agents for higher plant sexual reproduction, and an essential source of food for

human and animal consumption. They consist of three major regions – the seed coat,

endosperm, and embryo – which have different genetic origins, unique functions, and distinct

developmental pathways (1). The seed coat transfers nutrients from the maternal plant to the

embryo during seed development, and protects the seed during dormancy – a period of

quiescence where growth and development have ceased temporarily. The endosperm also

provides nourishment to the embryo, particularly during early embryogenesis, and in dicots,

degenerates and remains as a vestigial cell layer in the mature seed (2). The embryo, on the

other hand, differentiates into axis and cotyledon regions – the former giving rise to the mature

plant after germination, while the latter is terminally differentiated and accumulates storage

reserves that are used as an energy source for the germinating seedling (1, 3). Gene activity is

highly regulated with respect to space and time within each seed region (4). However, the

detailed mechanisms required for the differentiation of each seed region remain to be identified.

Selective DNA methylation of maternal and paternal alleles, or imprinting, plays a

critical role in endosperm development (5-7). Imprinting, however, does not appear to play a

major role in embryo formation (6, 7). The extent to which DNA methylation events regulate the

activity of specific genes during embryogenesis is largely unknown. Recently, we (8), and

others (9-11), showed that, on a global basis, CG- and CHG-context methylation does not change

significantly during seed development. By contrast, CHH-context methylation increases

primarily within transposons during the period leading up to dormancy (8-11), and appears to be

a fail-safe mechanism to ensure that transposons remain silent and do not inactivate genes

essential for seed development and germination (8).

In the course of our seed methylome studies with both soybean and Arabidopsis, we

identified several genes important for seed formation, including storage protein and fatty acid

metabolism genes, which are located within hypomethylated genomic regions that remain static

with respect to DNA methylation throughout development (8). These regions resemble the DNA

methylation valleys (DMVs) (12), hypomethylated canyons (UMRs) (13), and non-methylated

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 5: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

5

islands (NMIs) (14) found in many animal cell types which are enriched with transcription factor

genes and coated with specific histone marks. Genes within these DMVs are not regulated by

DNA methylation events, but by both transcriptional and epigenetic processes at the chromatin

level (12-15).

In this study, we scanned soybean and Arabidopsis seed genomes for DMVs. We found

that a significant fraction of these seed genomes contain DMV regions that have < 5% bulk DNA

methylation, or, in many cases, no detectable DNA methylation. Methylation levels of seed

DMVs do not vary detectably during seed development with respect to time, region, and tissue,

and appear to present prior to fertilization. Seed DMVs are enriched in transcription factor genes

and other genes critical for seed development, and are also decorated with histone marks that

fluctuate with developmental stage, resembling in significant ways their animal counterparts.

We conclude that many genes playing important roles in seed formation are regulated in the

absence of detectable DNA methylation events, and suggest that selective action of

transcriptional activators and repressors, as well as chromatin epigenetic events play important

roles in making a seed – particularly embryo formation.

Results

DMVs Represent a Significant Portion of Soybean Seed Genomes. We scanned soybean seed

methylomes from the globular stage through dormancy and germination for regions with < 5%

bulk methylation in all cytosine contexts (CG, CHG, and CHH) using a sliding 5 kb window

with 1 kb smaller steps to search for hypomethylated regions, or DMVs (Figs. 1 and 2A). Seed

DMVs were defined as genomic regions with < 5% bulk methylation over all developmental

stages investigated (see Materials and Methods). This strategy identified 21,669 seed DNA

regions, or 21% of the soybean genome (210 Mb), that were hypomethylated and did not vary

significantly throughout seed development, or during early seed germination, with respect to

methylation status (Fig. 2B, Table 1, and Dataset S1). Ninety-nine percent of seed DMVs

identified during seed development was also shared with seedling and seedling cotyledon DMV

regions, and, as such, we refer to all DMVs uncovered by our genome scans as seed DMVs (see

Materials and Methods). Detailed analysis showed that (i) the majority of seed DMVs had bulk

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 6: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

6

methylation levels between 0 to 1% in all cytosine contexts (Appendix Fig. S1A), and (ii) the

average bulk methylation level per DMV was 0.14% (Appendix Fig. S1B), or, on average, one

methylated cytosine per 1 kb of DMV (Appendix Fig. S1B), indicating that our strategy was

robust enough to identify genomic regions that were significantly hypomethylated over all

developmental stages. By contrast, the average bulk methylation level of the soybean genome

was 11.5%, or, on average, 43 methylated cytosines per 1 kb (Appendix Fig. S1B). We validated

this approach by scanning soybean seed methylomes at a < 0.4% bulk methylation criterion –

which was the lowest level of bulk methylation identifiable using bisulfite (BS)-Seq (8) (see

Materials and Methods) – and uncovered 14,558 seed DMVs, or 112 Mb of the seed genomic

regions, which effectively had no detectable cytosine methylation over all of the developmental

stages examined -- (Table 1, Appendix Fig. S1B, and Dataset S1).

Soybean seed DMVs averaged 10 kb in length, extended up to 76 kb, and 50% of the

DMVs ranged in size from 6 to 11 kb at the < 5% scanning criterion (Table 1 and Fig 2C).

These values were somewhat smaller using the more stringent < 0.4% scanning criterion (Table

1). Soybean seed DMVs were located primarily on the arms (Fig. 2D) of all chromosomes

(Appendix Fig. S2), although over 2,000 DMVs were embedded within highly methylated

pericentromeric regions (Fig. 2D). One example of a 28 kb seed DMV that was located on the

distal arm of chromosome 3 is shown in Fig. 2E. This DMV region had four genes, an average

bulk cytosine methylation level of 0.04%, and was static with respect to methylation level during

seed development – from the globular stage through dormancy and early germination (Fig. 2A

and E). Taken together, these results indicate that DMVs represent a significant portion of the

soybean seed genome, and do not vary significantly with respect to methylation status during

seed development and early germination.

Soybean DMVs Are Present in Different Seed Regions and Tissues. Soybean DMVs were

identified using methylome data from whole seeds (see Materials and Methods and Fig. 2A) (8).

To determine whether the seed DMVs were present within different seed regions and tissues, we

scanned soybean seed methylomes from (i) embryonic axis and cotyledon regions at three

developmental stages, and (ii) specific seed coat and cotyledon tissues at early maturation for

DMVs at the < 5% methylation criterion (Dataset S1), and compared these DMVs with those

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 7: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

7

identified in seeds as a whole (Fig. 3 A-C). We generated these methylomes previously from

hand-dissected seed regions (seed coat, axis, and cotyledons), and from specific seed tissues that

were captured using laser capture microdissection (LCM) (8).

The vast majority of whole seed DMVs overlapped with those identified within different

seed regions and tissues (Fig. 3D). For example, 99% of the whole seed DMVs were found

within the seed coat and axis, while 98% were present with the cotyledons. Similarly, 99% of

the whole seed DMVs were represented in specific seed coat tissue layers (parenchyma and

palisade) and within adaxial and abaxial cotyledon parenchyma tissues (Fig. 3D). These results

indicate that soybean seed DMVs are not confined to one seed compartment, but are present

throughout the seed within diverse regions and tissues that have unique functions and

developmental fates, and are a unique feature of seed and seedling genomes.

Soybean Seed DMVs Are Enriched in Transcription Factor Genes. We searched seed

DMVs for genes, and scored identified genes as DMV genes if they had their gene bodies and 1

kb of 5’ and 3’ flanking regions within DMV regions (see Material and Methods, Fig. 2E). We

uncovered 14,328 and 6,377 diverse genes within DMVs at the < 5% and < 0.4% scanning

criteria, respectively (Table 1 and Dataset S2). The former represented 26% of all genes within

the soybean genome (Fig. 4A).

Remarkably, 46% of soybean transcription factor (TF) genes (1,721) were located within

seed DMV regions (Fig. 4A and Table 1) – a significant enrichment of TF genes within the

soybean genome (t-test, P < 0.05). Gene ontology (GO) analysis indicated that the most enriched

seed DMV gene GO functional group was regulation of transcription [False Discovery Rate

(FDR) 4.9 x 10-80], containing 1,413 TF genes, and was consistent with the large number of TF

genes represented within seed DMVs (Fig. 4A, Table 1, and Dataset S3). A control group of

14,328 randomly selected soybean genes had no GO enrichment terms. DMV genes were

distributed into many developmentally and physiologically relevant GO functional classes, such

as developmental processes, response to hormones, response to stimulus, and transport, among

others (Fig. 4B). Seed relevant developmental categories included pattern specification,

cotyledon vascular tissue formation, radial pattern formation, organ boundary formation, and

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 8: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

8

multicellular development (Fig. 4B). Each of these functional categories contained large

numbers of specific TF genes that most likely guide and control these processes (Fig. 4C).

Together, these data show that TF genes that play major roles in soybean seed formation are

preferentially located with DMV regions.

Many Soybean DMV Genes Are Regulated During Seed Development. We analyzed

soybean whole seed RNA-Seq data, and RNA-Seq data from specific seed regions, subregions,

and tissues captured using LCM, for DMV genes that were regulated during seed development

(see Materials and Methods). We found a large number of DMV genes that were up-regulated >

5-fold in specific seed developmental stages, including many TF genes (Fig. 5A and B, Dataset

S2). For example, GmWOX9a, GmSPCH, and GmTOM3 TF genes were expressed specifically

at the globular, early-maturation, and late-maturation stages of seed development, respectively,

and were present in DMV regions that did not vary significantly with respect to methylation

during seed formation (Fig. 5C). All of the DMV stage-specific TF genes (Fig 5B) were

represented within the regulation of transcription and developmentally relevant functional GO

groups (Fig. 4B and C).

A large number of DMV genes, including those encoding TFs, were regulated with

respect to specific seed regions, sub-regions, and tissues at different developmental stages as

well (Appendix Fig. S3 A-C and Dataset S2). Approximately, 22% of all DMV genes (3,189)

were expressed preferentially within specific seed parts using a > five-fold up-regulation

criterion (Appendix Fig S3A) – representing almost half of all soybean seed-region-, subregion-,

and tissue-specific genes (Appendix Fig. S3B). For example, a TRIHELIX DNA BINDING

PROTEIN gene located within a 10 kb seed DMV region that was shared by all seed

developmental stages, regions, and tissues investigated (8), was expressed specifically within

early-maturation stage seed coat palisade tissue layer (Appendix Fig. S3D and E). This DMV

had a maximum of three methylated cytosines over its entire 10 kb length, did not change its

methylation status during development and, was flanked on either side by cliffs of highly

methylated transposable elements (Appendix Fig. S3E). Taken together, these data show that a

large number of soybean genes that reside within seed DMV regions are regulated with respect

to space and time and participate in important seed developmental processes (Fig. 4).

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 9: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

9

DMV Genes Are Enriched in H27Kme3 and Bivalent Histone Marks. We investigated

soybean embryos at different developmental stages to determine whether DMV genes were

coated with specific histone marks (see Material and Methods). We used embryos instead of

whole seeds to minimize the representation of different cell types in our chromatin preparations.

Approximately 90% of all cells were derived from cotyledon parenchyma tissue at all stages of

embryo development investigated for histone marks (16).

DMV genes were enriched significantly (hypergeometric test, P < 0.001) in H3K27me3

and bivalent (H3K27me3 and H3K4me3) marks at the cotyledon, early-maturation, mid-

maturation, and late-maturation stages, as well as in seedlings, compared with all soybean genes

(Fig. 6A) – a distinctive feature of animal cell DMVs (12, 13). By contrast, there was no

significant enrichment of H3K4me3 at any developmental stage, although a large fraction of

expressed DMV genes were coated with this histone mark (Fig. 6A). DMV genes marked with

H3K4me3 had the highest expression levels, followed by bivalent and H3K27me3 marked genes,

respectively (Appendix Fig. S4). H3K27me3 marked DMV genes were repressed, or expressed

at very low levels, consistent with the repressive nature of the H3K27me3 mark (13).

Surprisingly, DMV genes coated with H3K27me3 and bivalent marks were enriched

significantly for the transcriptional regulation functional GO term group, a property also similar

to their animal DMV gene counterparts (12, 13). For example, at mid-maturation H3K27me3

and bivalent marked DMV genes had FDRs of 2.93 x 10-24 and 3.92 x 10-108 for this GO term

group, respectively. Supporting these results, 62 to 74% of the 1,721 DMV TF genes (Table 1)

contained bivalent or H3K27me3 marks depending upon the developmental stage – a significant

enrichment compared with all TF genes (hypergeometric test, P < 0.001) (Fig. 6B). By contrast,

DMV genes marked with H3K4me3 did not generate a transcriptional regulation enrichment

group by GO analysis.

The chromatin marks for many seed DMV genes changed during development in parallel

with their expression levels at specific stages (Fig. 6A and Appendix Fig. S4). For example, the

P34 ALLERGEN PROTEIN gene had an H3K27me3 repressive mark at the cotyledon stage

where it was repressed, an H3K4me3 active mark during early and mid maturation where it was

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 10: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

10

active at a high level, and an H3K27me3 mark in late maturation and post-germination seedlings

where it was silent (Fig. 6C). The ABSCISIC INHIBITOR3-1 (ABI3-1) TF gene had an active

H3K4me3 mark during seed development when it was expressed, but an H3K27me3 mark

following germination when it was repressed (Fig. 6C). Finally, the AINTEGUMENTA-LIKE5-1

(AIL5-1) TF gene had bivalent marks at cotyledon and early-maturation stages where it was

silent or expressed at low levels, an H3K4me3 mark during mid-maturation and late maturation

stages where it became active, and a bivalent mark within post-germination seedlings where it

was repressed (Fig. 6C). Together, these results indicate that (i) seed DMVs are enriched in TF

genes that are preferentially coated with H3K27me3 and bivalent marks, (ii) genes within DMV

regions undergo modifications in chromatin epigenetic state in the absence of DNA methylation

changes during development, and (iii) seed DMVs have features strikingly similar to their animal

cell counterparts.

Arabidopsis Seeds Also Contain DMV Regions Enriched With Transcription Factor Genes.

We scanned Arabidopsis methylomes from seeds at different developmental stages (globular

stage through dry seed) and leaves from post-germination plants to determine whether DMVs

were a conserved feature of plant genomes (Fig. 7A). We used the same strategy to identify

Arabidopsis seed genomic regions with < 5% and < 0.4% average bulk methylation levels as we

did to uncover soybean seed DMVs (see Materials and Methods) (Fig. 1). Ninety-nine percent

of the DMVs identified during seed development were also shared with leaf DMVs, and, as such,

we refer to all DMVs as seed DMVs (see Materials and Methods).

Approximately 41% of the Arabidopsis genome, or 4,829 regions, consisted of seed

DMVs at the < 5% scanning criterion (Fig. 7B, Table 1, and Dataset S1). These regions had a

bulk cytosine methylation level of 0.24%, on average, compared with 5.8% for the Arabidopsis

genome as a whole (Appendix Fig. S1C and D). Arabidopsis DMV regions averaged 10 kb in

length and were localized primarily on the arms of all chromosomes, similar to what was

observed in soybean (Table 1, Figs. 7C and D). A smaller number of seed DMVs (3,386) were

uncovered using the < 0.4% scanning criterion that identified genomic regions with essentially

no methylation (Table 1).

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 11: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

11

Approximately 32% of all Arabidopsis genes (8,710), including 48% of those encoding

TFs (835), were localized within seed DMV regions at the < 5% scanning criterion (Fig. 7E,

Table 1, and Dataset S2). Arabidopsis TF genes were enriched significantly within DMVs (t-

test, P < 0.05), and about one-half overlapped with those present within soybean DMV regions.

GO enrichment analysis of all Arabidopsis seed DMV genes showed that the most significant

GO functional groups were sequence-specific DNA binding TF activity, nucleic acid TF binding

activity, and regulation of transcription (Fig. 7F and Dataset S4), similar to what was observed

for soybean DMVs (Fig. 4). Other developmentally relevant GO functional groups, such as

response to hormones, also showed significant enrichment (Appendix Fig. S5 and Dataset S4).

DMV genes shared between Arabidopsis and soybean showed GO enrichment for many

biological processes related to development, as predicted from the DMV GO analysis of each

plant species separately (Appendix Fig. S6). Finally, large numbers of Arabidopsis genes that

reside within DMVs, including those encoding TFs, were regulated with respect to time and

space during seed development (Appendix Figs. S7 and S8, and Dataset S2). Together, these

results show that Arabidopsis seed DMV regions have features identical to those in soybean

seeds, and that DMVs enriched for TF genes and other genes that carry out important seed

developmental functions are a conserved feature of seed genomes.

Most Seed DMV Regions Are Present Before Fertilization. In both soybean and Arabidopsis,

99% of DMVs uncovered during seed development were conserved within the genomes of

seedlings (soybean) and leaves (Arabidopsis), indicating that the seed DMV methylation status

did not change significantly following germination (see Materials and Methods). We scanned

Arabidopsis sperm, vegetative, and central cell methylomes generated by others for DMVs (17,

18) (Fig. 1), and compared DMVs present in these gametophytic cells with those uncovered from

our Arabidopsis post-mature green and dry seed methylomes (8) to determine whether seed

DMVs were present prior to fertilization (Fig. 7G). We used seed methylomes from the Col-0

ecotype (Fig. 7G) instead of Ws-0 (Fig. 7A-F), because the gametophytic cell methylomes were

from Col-0, and we found that 10% of Ws-0 seed DMVs differed from those in Col-0 at the

same developmental stages due to cytosine polymorphisms (8, 19, 20). Thus, it was important to

carry out within ecotype DMV comparisons.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 12: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

12

The vast majority (86 to 95%) of Arabidopsis seed DMV regions were also present

within female gametophyte (central cell) and male gametophyte (sperm and vegetative cells)

genomes prior to fertilization (Fig. 7G). The small number of seed DMV regions that were not

scored as DMVs in sperm and central cells had average bulk methylation levels > 5%, and,

therefore, higher methylation levels before fertilization and seed development. These results

indicate that most seed DMV regions are highly conserved, and do not change significantly with

respect to methylation status during major periods of the plant life cycle. They are present

within gametophytic cells before fertilization, developing seeds after fertilization, growing

sporophytic seedlings following seed germination, and leaves of the young sporophytic plant.

Many Genes Known to Play Important Roles in Seed Development Are Present Within

Soybean and Arabidopsis DMVs. GO analysis indicated that genes contained within both

soybean and Arabidopsis DMVs were significantly enriched for those that play major regulatory

roles during seed formation (Figs. 4 and 7F, Appendix Figs. S5 and S6). We searched soybean

and Arabidopsis DMVs for genes that were known to play critical roles in seed differentiation

and development (Dataset S5). Many genes essential for embryo formation were localized

within both soybean and Arabidopsis DMVs – including several WUS HOMEOBOX-

CONTAINING (WOX) genes (e.g., WOX1, WOX3, WOX8/9), BABY BOOM, SCARECROW,

SHATTERPROOF1, PLETHORA, TARGET OF MONOPTEROS5, and CUP SHAPED

COTYLEDON family genes (e.g., CUC2, CUC3). Other genes playing major roles in seed

formation, such as CLAVATA3, PIN1, YUCCA4, BODENLOS, and HANABA TARANU, were

also present within both soybean and Arabidopsis DMVs. In addition, several gene classes that

play critical roles in seed physiological processes, such as those encoding storage proteins (e.g.,

GmGlycinin1, AtCruciferin1) and hormone biosynthesis enzymes [e.g., Gibberellic Acid Oxidase

(GmGA20Ox2, GmGA3Ox3, AtGm20Ox2, AtGA3Ox3)] were also localized with both soybean

and Arabidopsis DMVs. Together, these results suggest that many genes playing essential roles

in seed formation are conserved within the DMVs of divergent plant species.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 13: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

13

Discussion

We have shown that there are large portions of soybean and Arabidopsis seed genomes that are

hypomethylated and enriched with TF genes. Seed DMVs resemble in striking ways the DMVs

that are present in the cells of many vertebrate animals, including humans (12-14, 21). DMVs in

both kingdoms are (i) scattered across the genome in thousands of long hypomethylated regions,

(ii) do not vary significantly with respect to their methylation status across diverse

developmental stages, tissues, and cell types, (iii) are enriched significantly with

developmentally important genes, including large numbers of TF genes, (iv) undergo epigenetic

changes at the chromatin level that can accompany changes in gene activity, and (iv) present

within diverse species. Thus, DMVs appear to be a unique feature of eukaryotic genomes that

play important roles in cell differentiation and development in the absence of methylation

changes at the DNA level. It has been proposed that DMVs arose in animal cells as a result of

strong negative selection, or purification, for transposable element insertions that would disrupt

critical regulatory genes and have a deleterious effect on development (22). Whether this is the

case for the seed DMVs uncovered here remains to be determined.

A large number of genes within both soybean and Arabidopsis DMVs are regulated

during seed development, and are expressed within specific seed stages, regions, subregions, or

tissues. These genes include those encoding TFs, as well as others, such as storage protein

genes, which play important roles in seed formation and germination. The remarkable feature of

these genes is that they retain their hypomethylated status within DMVs regardless of their state

of expression. How then are these genes turned on and off in the absence of methylation changes

that have been shown to play, for example, a role in differential activity of maternal and paternal

alleles which is essential for seed endosperm development (7)?

One possibility is that many DMV genes, especially those encoding storage proteins, are

activated and repressed by the action of seed-specific regulators such as LEAFY COTYLEDON1

(LEC1), ABSCISIC ACID INHIBITOR3 (ABI3), and FUSCA3 (FUS3) (23), among others.

Epigenetic changes at the chromatin level within seeds may also partner with the action of TFs to

facilitate development-specific changes in gene activity. Soybean seed DMV genes, like their

animal counterparts (12, 13, 15), are enriched for H3K27me3 and bivalent marks, and the histone

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 14: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

14

marks coated on DMV genes, including H3K4me3, can change in parallel with their gene

activity levels. DMV genes that are repressed, or expressed at low levels, are coated with

repressive H3K27me3 marks, whereas those that are active are marked with either H3K4me3 or

bivalent marks. We assume that the bivalent marks on seed DMV genes resemble those that coat

animal DMV genes, rather than being caused by a mixture of tissues in which the genes are

active and repressed. First, seed DMV genes coated with bivalent marks are enriched with TF

genes, similar to their animal counterparts (24). Second, seed DMV genes that contain bivalent

marks are expressed at significantly lower levels than their counterparts marked with H3K4me3

– which is a feature of animal genes with bivalent marks (24). Finally, seed DMV genes marked

with both H3K27me3 and H3K4me3 were identified from soybean embryos that consist

primarily of parenchyma storage tissue cells that reside within cotyledon regions (Fig 3). Thus,

we favor a model in which the actions of TFs coupled with chromatin-level epigenetic changes

regulate genes within seed DMV regions during development in the absence of DNA

methylation events. In animals, TF genes have been shown to regulate other genes within the

same DMV, and are organized into self-contained chromatin domains (15). Whether that is also

true of seed DMVs remains to be determined.

What about seed genes that are not present within DMVs? These genes reside within

genomic regions with > 5% average bulk methylation at CG, CHG, and/or CHH sites. These

regions contain highly methylated transposable elements, genes with body methylation, or both

(8). Many soybean and Arabidopsis genes outside of DMV regions are also regulated with

respect to space and time during seed development (8). We showed previously that many of

these genes are activated and shut down in the absence of significant methylation changes similar

to genes within DMV regions (8). Whether these genes undergo major epigenetic chromatin

changes during seed development remains to be determined.

A major challenge for the future will be to determine the precise mechanisms by which

genes residing within both DMV and non-DMV regions are regulated during seed development.

This will require identifying the cis-control elements that program seed gene activity, the cognate

TFs that interact with these elements, proteins that drive epigenetic changes at the chromatin

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 15: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

15

level, and how genes expressed during seed development are organized into regulatory networks

that are required to form a seed.

Materials and Methods

General Strategy For Identifying DMVs. Soybean (Glycine max cv Williams 82) and

Arabidopsis (Ws-0 and Col-0) seed and post-germination methylome data were taken from our

previously published BS-Seq experiments (8). Specific details regarding the BS-Seq libraries

used in the experiments reported here, and how they were constructed and characterized, are

contained within the Materials and Methods and SI Materials and Methods reported in Lin et al

(8).

DMVs were identified using the strategy illustrated in Fig. 1 (12). The methylome at

each developmental stage was scanned using a sliding window of 5 kb with smaller 1 kb

incremental steps, and regions without cytosine bases were discarded. The bulk methylation

levels in CG-, CHG-, and CHH-contexts were calculated for all remaining windows (8, 25).

DMVs were defined as genomic regions with either < 5% or < 0.4% bulk methylation levels in

all three cytosine contexts over all developmental stages studied (Table 1). A 0.4% bulk

methylation level was used to identify DMVs with no detectable methylation, as this was the

lowest level of un-methylated cytosines that were not converted by BS to thymine in our

previous experiments (8). Overlapping DMVs were merged and reported as contiguous DMV

regions in Dataset S1. Only genes that contained their entire bodies and 1 kb of 5’ and 3’

flanking regions were designated as genes contained within DMV regions (Dataset S2).

Identification and Characterization of Soybean Seed DMVs. Soybean DMVs were identified

in the methylomes of globular, cotyledon, early-maturation, mid-maturation, late-maturation,

early-pre-dormancy, late-pre-dormancy, and dry seeds, as well as 6 d post-germination seedlings

and seedling cotyledons (Fig. 2). Ninety-nine percent of the DMVs identified during seed

development was also present as DMVs in post-germination seedlings and seedling cotyledons.

Thus, we refer collectively to these DMVs as soybean seed DMVs (Fig. 2 and Dataset S1).

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 16: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

16

GO term enrichment analysis of seed DMV genes. The (i) GOSeqR Bioconductor package

(26), (ii) SoyBase GO annotations (https://soybase.org/genomeannotation/index.php), (iii)

hypergeometric statistical test method, and (iv) a FDR < 0.05 (Benjamini-Hochberg multiple

testing correction) were used to identify GO terms enriched in soybean DMV genes (Fig. 4 and

Dataset S3) (23).

Expression analysis of seed DMV genes. dChip (27) was used to generate heat maps of soybean

up-regulated DMV mRNA levels in whole seeds (Fig. 5) and in specific seed regions, sub-

regions, and tissues (Appendix Fig. S3) throughout development. Whole seed RNA-Seq data

were taken from our previously published experiments (8) (GenBank Accession GSE29163).

RNA-Seq data for specific seed regions, sub-regions, and tissues were taken from the Harada-

Goldberg LCM datasets (GeneBank AccessionsGSE116036) (23). EdgeR was used to identify

DMV mRNAs that were up-regulated > 5-fold [FDR < 0.001 (Benjamini-Hochberg multiple

testing correction)] in specific seed stages, regions, sub-regions, and tissues (23).

Identification of histone marks on DMV genes. Embryos at different developmental stages

were used to characterize DMV genes for H3K4me3 and H3K27me3 histone marks (Fig. 6).

ChIP assays were carried out using anti-H3K4me3, anti-H3K27me3, and anti-H3 antibodies

according to our previously published protocol (23). ChIP-Seq library construction, DNA

sequencing analysis, and peak calls were carried out as described previously (23). H3 ChIP-Seq

results were used as a control to filter out noise during peak calling. MACS2 and SICER were

used to call H3K4me3 and H3K27me3 peaks, respectively (28-30). DMV genes were

considered marked with H3K4me3, H3K27me3, or both H3K4me3 and H3K27me3 (bivalent

marks) if peaks had a P value of < 0.05, and were associated with the gene body and/or 1 kb of

upstream region (23). All ChIP-Seq data reported in this paper were deposited in GenBank as

accession GSE114879.

Identification and Characterization of Arabidopsis Seed DMVs. Arabidopsis (Ws-0) DMVs

were identified in the methylomes of globular, linear cotyledon, mature green, post-mature

green, and dry seeds, as well as leaves from four week-old plants (8) (Fig. 7). Ninety-nine

percent of the DMVs identified during seed development was also present as DMVs in leaves.

Thus, we refer collectively to these DMVs as Arabidopsis seed DMVs (Fig. 7 and Dataset S1).

Arabidopsis (Col-0) methylomes from post-mature green and dry seeds (8), and those from

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 17: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

17

sperm, vegetative, and central cells (17, 18) were used to compare DMVs from seeds and

gametophytic cells.

GO term enrichment analysis. Virtual Plant 1.3 (31) was used for Arabidopsis gene GO

enrichment analysis with a cutoff FDR value < 0.05 (Benjamini-Hochberg multiple testing

correction) (Fig. 7 and Appendix Fig. S5 and S6).

Gene expression analysis. dChip (27) was used to generate heat maps of Arabidopsis DMV

mRNA levels in whole seeds (Fig. S7) and in specific seed regions and subregions (Appendix

Fig. S8) during development. Arabidopsis seed transcriptome data were taken from our

previously published GeneChip experiments [whole seeds (GeneBank Accession GSE680) (32);

regions and subregions GeneBank Accession GSE12404 (4)].

ACKNOWLEDGEMENTS

This work was supported by a grant from the National Science Foundation Plant Genome

Program (to R.B.G., M.P., and J.J.H.).

References

1. Goldberg RB, de Paiva G, & Yadegari R (1994) Plant Embryogenesis: Zygote to Seed. Science 266(5185):606-614.

2. Hands P, Rabiger DS, & Koltunow A (2016) Mechanisms of endosperm initiation. Plant Reprod 29(3):215-225.

3. Weijers D (2014) Genetic control of identity and growth in the early Arabidopsis embryo. Biochem Soc Trans 42(2):346-351.

4. Belmonte MF, et al. (2013) Comprehensive developmental profiles of gene activity in regions and subregions of the Arabidopsis seed. Proc Natl Acad Sci U S A 110(5):E435-444.

5. Bauer MJ & Fischer RL (2011) Genome demethylation and imprinting in the endosperm. in Curr Opin Plant Biol (Elsevier Ltd), pp 162-167.

6. Gehring M & Satyaki PR (2017) Endosperm and Imprinting, Inextricably Linked. in Plant Physiol, pp 143-154.

7. Satyaki PR & Gehring M (2017) DNA methylation and imprinting in plants: machinery and mechanisms. Crit Rev Biochem Mol Biol 52(2):163-175.

8. Lin JY, et al. (2017) Similarity between soybean and Arabidopsis seed methylomes and loss of non-CG methylation does not affect seed development. Proc Natl Acad Sci U S A 114(45):E9730-E9739.

9. An Y-QC, et al. (2017) Dynamic Changes of Genome-Wide DNA Methylation during Soybean Seed Development. in Sci. Rep. (Springer US), pp 1-14.

10. Bouyer D, et al. (2017) DNA methylation dynamics during early plant life. Genome Biol 18(1):179.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 18: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

18

11. Kawakatsu T, Nery JR, Castanon R, & Ecker JR (2017) Dynamic DNA methylation reconfiguration during seed development and germination. Genome Biol 18(1):171.

12. Xie W, et al. (2013) Epigenomic analysis of multilineage differentiation of human embryonic stem cells. Cell 153(5):1134-1148.

13. Jeong M, et al. (2014) Large conserved domains of low DNA methylation maintained by Dnmt3a. Nat Genet 46(1):17-23.

14. Long HK, et al. (2013) Epigenetic conservation at gene regulatory elements revealed by non-methylated DNA profiling in seven vertebrates. Elife 2:e00348.

15. Li Y, et al. (2018) Genome-wide analyses reveal a role of Polycomb in promoting hypomethylation of DNA methylation valleys. Genome Biol 19(1):18.

16. Goldberg RB, Hoschek G, Tam SH, Ditta GS, & Breidenbach RW (1981) Abundance, Diversity, and Regulation of mRNA Sequence Sets in Soyeab Embryogenesis. Developmental Biology 83:201-217.

17. Hsieh PH, et al. (2016) Arabidopsis male sexual lineage exhibits more robust maintenance of CG methylation than somatic tissues. Proc Natl Acad Sci U S A 113(52):15132-15137.

18. Park K, et al. (2016) DNA demethylation is initiated in the central cells of Arabidopsis and rice. Proc Natl Acad Sci U S A 113(52):15138-15143.

19. Kawakatsu T, et al. (2016) Epigenomic Diversity in a Global Collection of Arabidopsis thaliana Accessions. Cell 166(2):492-505.

20. Vaughn MW, et al. (2007) Epigenetic Natural Variation in Arabidopsis thaliana. PLoS Biol 5(7):e174.

21. Stadler MB, et al. (2011) DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature 480(7378):490-495.

22. Simons C, Makunin IV, Pheasant M, & Mattick JS (2007) Maintenance of transposon-free regions throughout vertebrate evolution. BMC Genomics 8:470.

23. Pelletier JM, et al. (2017) LEC1 sequentially regulates the transcription of genes involved in diverse developmental processes during seed development. Proc Natl Acad Sci U S A 114(32):E6710-E6719.

24. Bernstein BE, et al. (2006) A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 125(2):315-326.

25. Hsieh TF, et al. (2009) Genome-wide demethylation of Arabidopsis endosperm. Science 324(5933):1451-1454.

26. Young MD, Wakefield MJ, Smyth GK, & Oshlack A (2010) Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol 11(2):R14.

27. Li C & Wong WH (2001) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology 2(8):research0032.0031-0032.0011.

28. Zhang Y, et al. (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol 9(9):R137.

29. Zang C, et al. (2009) A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics 25(15):1952-1958.

30. Bailey T, et al. (2013) Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Comput Biol 9(11):e1003326.

31. Katari MS, et al. (2010) VirtualPlant: a software platform to support systems biology research. Plant Physiol 152(2):500-515.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 19: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

19

32. Le BH, et al. (2010) Global analysis of gene activity during Arabidopsis seed development and identification of seed-specific transcription factors. Proc Natl Acad Sci U S A 107(18):8063-8070.

33. Robinson MD, McCarthy DJ, & Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139-140.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 20: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

20

Table 1. Summary of Soybean and Arabidopsis DMV Characteristics Soybean Arabidopsis < 5%* < 0.4%* < 5%* < 0.4%*

Number of DMVs 21,669 14,558 4,829 3,386 DMV Genome Size 210 Mb 112 Mb 49 Mb 27 Mb Average DMV Length 10 ± 6 kb 8 ± 3 kb 10 ± 6 kb 8 ± 4 kb Maximum DMV Length 76 kb 41 kb 137 kb 39 kb Number of DMV Genes 14,328 6,377 8,710 3,736 Number of DMV TF Genes 1,721 924 835 457

* Maximum DMV bulk methylation level in CG-, CHG-, and CHH-contexts across all developmental stages. † Mean length +/- standard deviation.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 21: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. 1. A strategy to identify seed DNA Methylation Valleys (DMVs). Seed methylomes (8) were scanned across the genome at each stage of development (arrows) using a 5 kb sliding window with smaller 1 kb incremental steps (see Materials and Methods) (dark bracketed lines). The bulk methylation levels in CG-, CHH-, and CHG-contexts were calculated for each window at every developmental stage (8). Genomic regions with bulk methylation levels of < 5% or < 0.4% across all developmental stages were designated as DMVs, and overlapping DMVs were merged to define DMV regions (red line).

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 22: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. 2. Identification of soybean seed DMVs with < 5% bulk methylation level. (A) Seed and post-germination methylomes used to identify DMVs (8). glob, cot, em, mm, lm, pd1, and pd2 refer to globular, cotyledon, early-maturation, mid-maturation, late-maturation, early pre-dormancy, and late pre-dormancy seed developmental stages, respectively. sdlg and sdlg-COT refer to 6 day post-germination seedling and seedling cotyledon, respectively. Seed images are not drawn to scale. (B) Percentage of seed DMVs with < 5% bulk methylation in the soybean genome (see Materials and Methods). (C) Box plot of seed DMV lengths. Horizontal bar represents a median length of 8 kb. (D) Percentages of seed DMVs in chromosomal pericentromeric and arm regions. (E) Genome browser view of a 28 kb DMV located on chromosome 3. Genes in red color (Glyma03g39410, EPOXIDE HYDROLASE; Glyma03g39420, 60S RIBOSOMAL PROTEIN L18-3; Glyma03g39440, SERINE/THREONINE PROTEIN PHOSPHATASE 2A; Glyma03g39450, DORMANCY/AUXIN ASSOCIATED PROTEIN) are located within this DMV, including 1 kb of 5’ and 3’ flanking regions. Genes in gray color are either located partially or outside this DMV region.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 23: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. 3. Soybean seed DMVs are present in specific seed regions and tissues. (A) Paraffin section of a cotyledon-stage seed. AX, COTL, and SC refer to axis, cotyledon, and seed coat, respectively. (B) Whole-mount photographs of embryo and SC from early-maturation and mid-maturation stage seeds. (C) Hand drawn colorized cross-section of an early-maturation stage seed. The expanded red box represents an enlarged SC region. COTL-AB-PY, COTL-AD-PY, SC-PA, and SC-PY refer to cotyledon abaxial parenchyma (dark green), cotyledon adaxial parenchyma (light green), seed coat parenchyma (light blue), and seed coat palisade tissues (darkblue), respectively. (D) Percentage of seed DMVs that overlap with specific seed region and tissue DMVs. Scale bars indicate 100 um (A) and 1 mm (B).

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 24: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. 4. Biological process gene ontology (GO) terms that are enriched in soybean seed DMV genes. (A) Proportion of seed DMV genes and transcription factor (TF) genes with < 5% bulk methylation in the soybean genome. (B) Enriched GO terms with a Log10 False Discovery Rate (FDR) < 0.05, which are also listed in Dataset S3 (see Materials and Methods). (C) Venn diagram showing the number of DMV TF genes that are unique to each of the five GO term biological function groups.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 25: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. 5. Expression profiles of soybean DMV genes that are up-regulated > 5-fold during seed development. Heat maps of up-regulated seed DMV genes (A) and TF genes (B) generated by edgeR analysis (33) of soybean RNA-seq whole seed datasets (GSE29163) (8, 23) (see Materials and Methods). The number of up-regulated DMV genes and TF genes in each stage is listed on the right side of the heat maps and listed in Dataset S2. Seed images are not drawn to scale. (C) Methylome and RNA-Seq genome browser views of three > 5-fold up-regulated stage-specific TF genes [GmWOX9a, WUSCHEL RELATED HOMEOBOX 9A; GmSPCH, SPEECHLESS; and GmTOM3, TARGET OF MONOPTEROS 3 (red gene models)]. Red triangles mark all genes within the DMVs, including those that are not seed stage-specific. Seed stage, region, and tissue abbreviations are defined in the legend to Figs. 2 and 3. TEs, transposable elements.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 26: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. 6. DMV genes with histone marks across soybean seed development. (A) Percentage of DMV genes and all genes in the genome marked with H3K4me3, H3K27me3, H3K4me3 and H3K27me3 (bivalent mark), or no mark at each developmental stage (GenBank Accession GSE114879). (B) Percentage of DMV TF genes and all TF genes in the genome marked with H3K4me3, H3K27me3, bivalent mark, or no mark at each developmental stage (C) Genome browser views showing histone marks and RNA-Seq levels for specific DMV genes during seed development and germination. GmLEA (LATE EMBRYO ABUNDANT), GmABI3 (ABSCISIC ACID INHIBITOR3), GmAIL5-1 (AINTEGUMENTA LIKE5-1).

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 27: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. 7. Identification of Arabidopsis seed DMVs with < 5% bulk methylation level. (A) Arabidopsis methylomes used to identify DMVs (8). Images are not drawn to scale. (B) Proportion of seed DMVs with < 5% bulk methylation level in the Arabidopsis genome. (C) Box plot of seed DMV lengths. Horizontal bar represents a median length of 8 kb. (D) Proportion of seed DMVs in chromosomal pericentromeric and arm regions. (E) Proportion of seed DMV genes and DMV TF genes with < 5% bulk methylation in the Arabidopsis genome. (F) Bar plots of the most significantly enriched GO terms of Arabidopsis seed DMV genes, which are also listed in Dataset S4. Log10 P-value refers to FDR. GO terms related to transcription factor activity are highlighted in red. (G) Percentages of seed DMVs that overlap with sperm cell, vegetative cell, and central cell DMVs.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 28: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

1

Supplementary Information for Seed Genome Hypomethylated Regions Are Enriched In Transcription Factor Genes Min Chen, Jer-Young Lin, Jungim Hur, Julie M. Pelletier, Russell Baden, Matteo Pellegrini, John J. Harada, and Robert B. Goldberg

Robert B. Goldberg Email: [email protected] This PDF file includes:

Figs. S1 to S8 References for SI reference citations

Other supplementary materials for this manuscript include the following:

Datasets S1 to S5

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 29: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. S1. Bulk methylation levels of soybean and Arabidopsis DMVs. Average bulk methylation levels were calculated as described previously (see Materials and Methods) (1).

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 30: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. S2. Genome browser view of DMVs, genes, and transposable elements (TEs) across all 20 soybean chromosomes.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 31: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. S3. Expression profiles of soybean DMV genes that are up-regulated in specific seed regions, sub-regions, and tissues during development. (A) Percentage of DMV genes (14,328) that are up-regulated > 5-fold in specific seed regions, sub-regions, or tissues during development based on edgeR analysis (2) of the Harada-Goldberg LCM soybean seed RNA-seq datasets (GSE116036). (3). (B) Percentage of all > 5-fold up-regulated seed region-, sub-region-, and tissue-specific genes (7,400) that are in DMVs. (C) Heat maps of DMV genes that are up-regulated > 5-fold in specific seed regions, sub-regions, or tissues during development [red sectors in (A) and (B)]. Expression profile (D) of a TRIHELIX DNA BINDING PROTEIN TF gene that is specific for the SC palisade layer (Fig. 3C), and the genome browser view (E) of its DMV region. The bulk methylation level of this DMV was < 0.01%. Seed images are not drawn to scale. g and glob, globular stage; c and cot, cotyledon stage; h, heart stage; em, early-maturation stage; mm, mid-maturation stage; lm, late-maturation stage; pd1, early pre-dormancy stage; pd2, late pre-dormancy stage; sdlg, seedling; COTL, cotyledon; SC, seed coat; AB, abaxial; AD, adaxial; PY, parenchyma; PA, palisade; TE, transposable element.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 32: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. S4. Box plots of RNA-Seq counts for DMV genes and all genes in the soybean genome marked with H3K4me3, H3K27me3, and bivalent marks during seed development. RNA levels for genes with bivalent and H3K27me3 marks differed significantly from the RNA levels of H3K4me3 marked genes (t-test, P < 0.001).

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 33: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. S5. Biological process GO terms that are enriched in Arabidopsis seed DMV genes. Enriched GO terms with FDR < 0.05 are also listed in Dataset S4 (see Materials and Methods). Venn diagram shows the number of DMV TF genes that are unique to GO term biological function groups.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 34: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. S6. Biological process GO terms related to development that are enriched in seed DMV genes shared between Arabidopsis and soybean. Arabidopsis gene ID numbers were obtained for DMV genes shared between soybean and Arabidopsis. GO enrichment analysis with a cutoff FDR value < 0.05 was carried out as described for Arabidopsis genes in Materials and Methods.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 35: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. S7. Expression profiles of Arabidopsis DMV genes at different stages of seed development. Hierarchical clustering of Arabidopsis DMV mRNAs (A) and DMV transcription factor mRNAs (TFs) (B) during seed development using dChip and our previously published Arabidopsis whole seed mRNA GeneChip data (4). Only 6,682 of the 8,710 Arabidopsis DMV genes (Table 1) are represented on the ATH1 22k GeneChip used to obtain these expression profiles (4). The number of DMV genes represented in stage-specific RNA clusters is listed on the right side of each figure. Seed images are not drawn to scale.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 36: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Fig. S8. Expression profiles of Arabidopsis DMV genes that are up-regulated in specific seed regions and subregions during development. (A) Percentage of DMV genes (6,682) that are up-regulated > 5-fold in specific seed regions and subregions at different developmental stages based on our previously published Arabidopsis seed LCM mRNA data using the ATH1 22k GeneChip (5). (B) Heat maps of DMV mRNAs and those encoding transcription factors (TFs) that are up-regulated > 5-fold in specific seed regions and subregions during development. Seed images are not drawn to scale.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 37: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

26

References

1. Lin JY, et al. (2017) Similarity between soybean and Arabidopsis seed methylomes and

loss of non-CG methylation does not affect seed development. Proc Natl Acad Sci U S A 114(45):E9730-E9739.

2. Robinson MD, McCarthy DJ, & Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139-140.

3. Pelletier JM, et al. (2017) LEC1 sequentially regulates the transcription of genes involved in diverse developmental processes during seed development. Proc Natl Acad Sci U S A 114(32):E6710-E6719.

4. Le BH, et al. (2010) Global analysis of gene activity during Arabidopsis seed development and identification of seed-specific transcription factors. Proc Natl Acad Sci U S A 107(18):8063-8070.

5. Belmonte MF, et al. (2013) Comprehensive developmental profiles of gene activity in regions and subregions of the Arabidopsis seed. Proc Natl Acad Sci U S A 110(5):E435-444.

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 38: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Dataset S1. Soybean and Arabidopsis DMVs

* DMVs were identified using the strategy illustrated in Fig. 1. A 5 kb sliding window with 1 kb steps were used for both soybean and Arabidopsis.

Click the buttons to see the coordinates of DMVs.

Soybean Seed DMVs (CG, CHG & CHH <5%)

Soybean Seed DMVs (CG, CHG & CHH <0.4%)

Arabidopsis Seed DMVs (CG, CHG & CHH <5%)

Arabidopsis Seed DMVs (CG, CHG & CHH <0.4%)

Soybean Seed Coat DMVs (CG, CHG & CHH <5%)

Soybean Embryoic Axis DMVs (CG, CHG & CHH <5%)

Soybean Embryoic Cotyledon DMVs (CG, CHG & CHH <5%)

Soybean Cotyledon Abaxial Parenchyma DMVs (CG, CHG & CHH <5%)

Soybean Cotyledon Adaxial Parenchyma DMVs (CG, CHG & CHH <5%)

Soybean Seed Coat Palisade DMVs (CG, CHG & CHH <5%)

Soybean Seed Coat Parenchyma DMVs (CG, CHG & CHH <5%)

Arabidopsis Sperm Cell DMVs (CG, CHG & CHH <5%)

Arabidopsis Vegetative Cell DMVs (CG, CHG & CHH <5%)

Arabidopsis Central Cell DMVs (CG, CHG & CHH <5%)

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 39: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Dataset S2. Soybean and Arabidopsis DMV Genes* Soybean DMV genes are the genes completely located in soybean seed & seedling DMVs, including their entire bodies and 1 kb of 5’ and 3’ flanking regions.

† Arabidopsis DMV genes are the genes completely located in Arabidopsis seed & leaf DMVs, including their entire bodies and 1 kb of 5’ and 3’ flanking regions.

Soybean DMV Genes* Arabidopsis DMV Genes†

Soybean Stage-Specific DMV Genes (Fig. 5)

Soybean Region-Specific DMV Genes (Fig. S3)

Arabidopsis Stage-Specific DMV Genes (Fig. S7)

Arabidopsis Region-Specific DMV Genes (Fig. S8)

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 40: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Ontology Category

Enriched GO Term ID DMV Genes in Enriched GO Term

Genes in Genome FDR GO Description Groups Used in Fig. 4‡

GO:0006355 1576 4035 4.9E-80 regulation of transcription, DNA-dependent Regulation of gene expressionGO:0045892 94 356 3.6E-03 negative regulation of transcription, DNA-dependent Regulation of gene expressionGO:0009741 99 206 6.5E-09 response to brassinosteroid stimulus Response to hormoneGO:0009751 127 285 7.4E-09 response to salicylic acid stimulus Response to hormoneGO:0009753 184 456 1.4E-08 response to jasmonic acid stimulus Response to hormoneGO:0009867 156 374 2.0E-08 jasmonic acid mediated signaling pathway Response to hormoneGO:0009733 327 916 5.5E-08 response to auxin stimulus Response to hormoneGO:0009739 108 240 7.0E-08 response to gibberellin stimulus Response to hormoneGO:0009862 148 363 2.7E-07 systemic acquired resistance, salicylic acid mediated signaling pathway Response to hormoneGO:0009737 365 1069 1.3E-06 response to abscisic acid stimulus Response to hormoneGO:0009873 79 182 4.5E-05 ethylene mediated signaling pathway Response to hormoneGO:0009723 215 634 1.1E-03 response to ethylene stimulus Response to hormoneGO:0009740 73 186 4.0E-03 gibberellic acid mediated signaling pathway Response to hormoneGO:0010200 337 738 1.4E-26 response to chitin Response to stimulusGO:0002679 200 408 1.1E-19 respiratory burst involved in defense response Response to stimulusGO:0010167 195 452 5.0E-12 response to nitrate Response to stimulusGO:0009611 354 942 1.2E-11 response to wounding Response to stimulusGO:0009414 296 831 3.7E-07 response to water deprivation Response to stimulusGO:0009612 69 141 1.1E-06 response to mechanical stimulus Response to stimulusGO:0006979 224 613 2.3E-06 response to oxidative stress Response to stimulusGO:0080167 113 285 6.1E-05 response to karrikin Response to stimulusGO:0010106 91 223 1.5E-04 cellular response to iron ion starvation Response to stimulusGO:0010583 139 384 1.0E-03 response to cyclopentenone Response to stimulusGO:0009595 82 212 3.3E-03 detection of biotic stimulus Response to stimulusGO:0010363 138 395 4.8E-03 regulation of plant-type hypersensitive response Response to stimulusGO:0035865 12 17 6.9E-03 cellular response to potassium ion Response to stimulusGO:0007275 107 234 3.1E-08 multicellular organismal development Developmental processGO:0007389 76 178 1.5E-04 pattern specification process Developmental processGO:0048653 48 105 1.1E-03 anther development Developmental processGO:0010588 20 32 1.2E-03 cotyledon vascular tissue pattern formation Developmental processGO:0010500 13 17 1.3E-03 transmitting tissue development Developmental processGO:0006949 32 63 1.7E-03 syncytium formation Developmental processGO:0010160 9 10 2.5E-03 formation of organ boundary Developmental processGO:0009956 17 27 3.5E-03 radial pattern formation Developmental processGO:0010093 30 60 3.7E-03 specification of floral organ identity Developmental processGO:0009860 149 429 3.9E-03 pollen tube growth Developmental processGO:0010089 75 194 5.4E-03 xylem development Developmental processGO:0010440 40 91 8.6E-03 stomatal lineage progression Developmental processGO:0015706 197 456 3.9E-12 nitrate transport TransportGO:0006612 369 1025 2.1E-09 protein targeting to membrane TransportGO:0030001 104 230 9.8E-08 metal ion transport TransportGO:0006865 171 437 5.5E-07 amino acid transport TransportGO:0000041 88 194 1.2E-06 transition metal ion transport TransportGO:0006826 100 237 9.5E-06 iron ion transport TransportGO:0015840 11 14 3.4E-03 urea transport TransportGO:0070838 65 161 3.7E-03 divalent metal ion transport TransportGO:0015698 18 30 4.7E-03 inorganic anion transport TransportGO:0009693 115 242 5.6E-10 ethylene biosynthetic process Cellular processGO:2000652 27 39 2.8E-06 regulation of secondary cell wall biogenesis Cellular processGO:0009828 45 82 4.4E-06 plant-type cell wall loosening Cellular processGO:0000165 207 570 1.1E-05 MAPK cascade Cellular processGO:0035556 112 276 1.9E-05 intracellular signal transduction Cellular processGO:0009831 32 57 1.5E-04 plant-type cell wall modification involved in multidimensional cell growth Cellular processGO:0009657 73 177 8.9E-04 plastid organization Cellular processGO:0008361 58 134 1.1E-03 regulation of cell size Cellular processGO:0042545 100 262 1.2E-03 cell wall modification Cellular processGO:0010310 75 188 2.1E-03 regulation of hydrogen peroxide metabolic process Cellular processGO:0044036 69 178 8.2E-03 cell wall macromolecule metabolic process Cellular processGO:0000080 16 21 1.9E-04 G1 phase of mitotic cell cycle OthersGO:0045736 27 50 1.7E-03 negative regulation of cyclin-dependent protein kinase activity OthersGO:0043086 54 130 5.4E-03 negative regulation of catalytic activity OthersGO:0051865 17 28 5.6E-03 protein autoubiquitination OthersGO:0000271 99 271 6.4E-03 polysaccharide biosynthetic process OthersGO:0003700 1712 3813 3.8E-140 sequence-specific DNA binding transcription factor activityGO:0043565 385 878 2.8E-26 sequence-specific DNA bindingGO:0003677 1328 3882 4.9E-26 DNA bindingGO:0004601 109 240 9.1E-08 peroxidase activityGO:0020037 274 745 1.2E-07 heme bindingGO:0015250 51 89 2.5E-07 water channel activityGO:0045735 68 135 7.1E-07 nutrient reservoir activityGO:0030599 87 194 6.0E-06 pectinesterase activityGO:0004857 75 163 1.3E-05 enzyme inhibitor activityGO:0004553 205 560 1.3E-05 hydrolase activity, hydrolyzing O-glycosyl compoundsGO:0009055 330 969 1.3E-05 electron carrier activityGO:0003674 2791 9893 1.0E-03 molecular functionGO:0009815 18 28 4.3E-03 1-aminocyclopropane-1-carboxylate oxidase activityGO:0000976 21 36 6.8E-03 transcription regulatory region sequence-specific DNA bindingGO:0042802 111 302 6.8E-03 identical protein bindingGO:0019825 150 430 7.8E-03 oxygen bindingGO:0015035 78 200 7.8E-03 protein disulfide oxidoreductase activityGO:0005576 1445 4126 3.4E-34 extracellular regionGO:0031225 189 436 3.6E-12 anchored to membraneGO:0005575 452 1279 1.3E-10 cellular_componentGO:0009505 261 681 7.3E-10 plant-type cell wallGO:0005618 475 1429 4.4E-07 cell wallGO:0048046 318 927 4.6E-06 apoplast

† The annotation of soybean transcription factor genes used in this dataset was obtained from Plant Transcription Factor Databse (PlantTFDB v3.0, http://planttfdb_v3.cbi.pku.edu.cn/index.php?sp=Gma).

Not included in Fig.4

Not included in Fig. 4

Dataset S3. GO Enrichment Analysis* of Soybean DMV Genes, Including Transcription Factor Genes† (TFs)

* Soybean GO enrichment analysis was carried out using GoSeq R-pacakage [(http://www.bioconductor.org/packages/release/bioc/html/goseq.html) (FDR<0.05 with Benjamini and Hochberg Correction)]. The soybean ontology annotatation used for the GO enrichment analysis was download from Soybase (https://www.soybase.org/genomeannotation/). Only enriched GO terms with FDR<0.01 are listed in this dataset.

‡ Groups were used in Fig. 4 for enriched biological process GO terms. In the same group, the enriched GO terms were sorted based on FDR.

Click"Enriched GO Term ID" to access DMV genes in the enriched Biological Process GO terms.

Bio

logi

cal P

roce

ssM

olec

ular

Fun

ctio

nC

ellu

lar

Com

pone

nt

This dataset represents the index of all enriched GO terms. Click the buttons below to view enriched GO terms in each Ontology Category.

Enriched GO Terms in Biological Process

Enriched GO Terms in Molecular Function

Enriched GO Terms in Cellular Component

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 41: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

Ontology Category

Enriched GO Term ID DMV Genes in Enriched GO Term

All Genes in the GO Term

P-value GO Description Groups Used in Fig. S5‡

GO:0045449 732 1621 3.2E-17 regulation of transcriptionGO:0010556 746 1667 3.2E-17 regulation of macromolecule biosynthetic processGO:2000112 746 1667 3.2E-17 regulation of cellular macromolecule biosynthetic processGO:0009889 762 1715 3.2E-17 regulation of biosynthetic processGO:0031326 760 1710 3.2E-17 regulation of cellular biosynthetic processGO:0019219 742 1696 1.4E-15 regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolic processGO:0080090 777 1795 1.4E-15 regulation of primary metabolic processGO:0051171 748 1717 1.8E-15 regulation of nitrogen compound metabolic processGO:0031323 794 1859 5.7E-15 regulation of cellular metabolic processGO:0010468 748 1799 5.0E-12 regulation of gene expressionGO:0060255 762 1848 1.0E-11 regulation of macromolecule metabolic processGO:0019222 829 2085 3.5E-10 regulation of metabolic processGO:0050794 1142 3050 2.7E-09 regulation of cellular processGO:0006355 374 883 2.5E-06 regulation of transcription, DNA-dependentGO:0051252 375 892 4.2E-06 regulation of RNA metabolic processGO:0050789 1216 3434 1.2E-05 regulation of biological processGO:0065007 1287 3661 1.5E-05 biological regulationGO:0009725 379 849 2.0E-08 response to hormone stimulusGO:0009739 67 112 9.5E-04 response to gibberellin stimulusGO:0009753 82 158 4.8E-03 response to jasmonic acid stimulusGO:0009733 128 287 1.0E-02 response to auxin stimulusGO:0010033 497 1148 1.3E-09 response to organic substanceGO:0009719 412 920 2.6E-09 response to endogenous stimulusGO:0042221 742 1892 1.4E-06 response to chemical stimulusGO:0050896 1290 3689 3.4E-05 response to stimulusGO:0010200 75 127 4.5E-04 response to chitinGO:0080167 74 127 7.0E-04 response to karrikinGO:0009628 505 1360 1.9E-03 response to abiotic stimulusGO:0009743 94 203 3.0E-02 response to carbohydrate stimulusGO:0019748 184 364 1.7E-06 secondary metabolic processGO:0009698 89 152 8.9E-05 phenylpropanoid metabolic processGO:0009699 74 121 2.1E-04 phenylpropanoid biosynthetic processGO:0055114 312 793 2.8E-03 oxidation-reduction processGO:0006725 143 322 6.3E-03 cellular aromatic compound metabolic processGO:0009827 34 47 8.8E-03 plant-type cell wall modificationGO:0009828 28 35 1.0E-02 plant-type cell wall looseningGO:0019438 101 218 2.0E-02 aromatic compound biosynthetic processGO:0001071 833 1678 2.1E-29 nucleic acid binding transcription factor activityGO:0003700 833 1678 2.1E-29 sequence-specific DNA binding transcription factor activityGO:0016491 588 1460 9.2E-07 oxidoreductase activityGO:0003677 594 1526 2.5E-05 DNA bindingGO:0004497 156 303 2.5E-05 monooxygenase activityGO:0020037 150 293 4.7E-05 heme bindingGO:0005506 174 357 5.5E-05 iron ion bindingGO:0046906 160 326 1.0E-04 tetrapyrrole bindingGO:0019825 122 231 1.2E-04 oxygen bindingGO:0009055 176 383 5.5E-04 electron carrier activityGO:0004091 158 349 2.7E-03 carboxylesterase activityGO:0004857 90 182 1.0E-02 enzyme inhibitor activityGO:0012505 1403 3499 9.4E-21 endomembrane systemGO:0005576 201 413 3.8E-06 extracellular regionGO:0048046 150 324 1.2E-03 apoplastGO:0031225 116 247 5.3E-03 anchored to membraneGO:0005618 224 553 5.7E-03 cell wallGO:0030312 226 557 5.7E-03 external encapsulating structure

Mol

ecul

ar F

unct

ion

Cel

lula

r C

ompo

nent

Dataset S4. GO Enrichment Analysis∗ of Arabidopsis DMV Genes, Including Transcription Factor Genes† (TFs)

* Arabidopsis GO enrichment analysis was carried out using VituralPlant 1.3 [(http://virtualplant.bio.nyu.edu/cgi-bin/vpweb/) (p-value<0.05, Fisher Exact Test with FDR correction)]

‡ Groups were used in Fig. S5 for enriched biological process GO terms.This dataset represents the index of all enriched GO terms. Click the buttons below to view enriched GO terms in each Ontology Category.

Not included in Fig.S5

Not included in Fig.S5

Response to stimulus

† The annotation of Arabidopsis transcription factor genes used in this dataset is obtained from Arabidopsis Transcription Factor Database (AtTFDB, http://agris-knowledgebase.org/AtTFDB/).

Biological regulation

Bio

logi

cal P

roce

ss

Response to hormone

Others

Click"Enriched GO Term ID" to access DMV genes in the enriched Biological Process GO terms.

Enriched GO Terms in Biological Process

Enriched GO Terms in Molecular Function

Enriched GO Terms in Cellular Component

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint

Page 42: Classification: Biological Sciences – Plant Biology · soybean and Arabidopsis seed genomes from post-fertilization through dormancy and germination for regions that contain < 5%

* Seed genes are published in the listed publications below.

List of Reference Publications:

† The annotation of soybean transcription factors used in this dataset was obtained from Plant Transcription Factor Databse (PlantTFDB v3.0, http://planttfdb_v3.cbi.pku.edu.cn/index.php?sp=Gma).

‡ The annotation of Arabidopsis transcription factor genes used in this dataset is obtained from Arabidopsis Transcription Factor Database (AtTFDB, http://agris-knowledgebase.org/AtTFDB/).

Dataset S5. Soybean and Arabidopsis Seed Genes*

1. Dolf Weijers, Genetic control of identity and growth in the early Arabidopsis embryo, Biochemical Society Transactions (2014), 42:346-3512. Steffan Lau, et. Al., Early Embryogenesis in Flowering Plants: Setting up the basic body pattern, Annu. Rev. Plant Biol., 2012, 63:482-5063. Jos R. Wendrich, Dolf Weijers, The Arabidopsis embryo as a miniature morphogenesis model, New Phytologist, 2013, 199:14-254. Philip Hands, et. Al., Mechamisms of endosperm initiation, Plant Reprod., 2016, 29:215-2255. Alice Pajoro, et. Al., The revolution of gene regulatory networks controlling Arabidopsis plant reproduction: a two-decade history, J. of Experimental Botany, 2014, 65:4731-47456. Deirdre Khan, et. Al., Predicting tra nscriptional circuitry underlying seed coat development, Plant Science, 2014, 223:146-1527. Gregorio Orozco-Arroyo, et. Al., Networks controlling seed size in Arabidopsis, Plant Reprod. 2015, 28:17-328. Helen North, et. Al., Arabidopsis seed secrets unravelled after a decade of genetic and omics-drivin research, the plant journal, 2010, 61:971-9819. Katrin Ehlers, et. Al., The MADS box genes ABS, SHP1, and SHP2 are essential for the coordination of cell divisions in ovule and seed coat development and for endosperm formation in Arabidopsis thaliana, PLoS ONE 11(10): e016507510. Olivier Coen, et. Al., Developmental patterning of the sub-epidermal integument cell layer in Arabidopsis seeds, Development, 2017, 144: 1490-149711. Michael Riefler, et. Al., Arabidpsis Cytokinin receptor mutants reveal functions in shoot growth, leaf senescence, seed size, germination, root development, and cytokinin metabolism, the plant cell, 2006, 18:40-5412. Nubia B. Eloy, et. Al., SAMBA, a plant-specific anaphase-promoting complex/cyclosome regulator is involved in early development and A-type cyclin stabilization, PNAS, 109: 13853-1385813. Antonio Gonzalez, et. Al., TTG2 controls the developmental regulation of seed coat tannins in Arabidopsis by regulating vacuolar transport steps in the proanthocyanidian pathway, Developmental Biology, 419:54-6314. Mingxue Chen, et. Al., TRANSPARENT TESTA GLABRA1 regulated the accumulation of seed storage reserves in Arabidopsis, Plant Physiology, 2015, 169:391-40215. Antonio Gonzalez, et. Al., TTG1 complex MYBs, MYB5 and TT2, control outer seed coat differentiation, Developmental Biology, 325:412-42116. TRANSPARENT TESTA2 regulates embryonic fatty acid biosynthesis by targeting FUSCA3 during the early developmental stage of Arabidopsis seeds, The plant J. 2014, 77: 7575-76917. Bert De Rybel, et. Al., a bHLH complex controls embryonic vascular tissue establishment and indeterminate growth in arabidopsis, Cell, 2013, 24:426-43718. Wolfram G. Brenner, et. Al., Gene regulation by cytokinin in Arabidopsis, Frontiers in plant science, 2012, 3:8. doi: 10.3389/fpls.2012.0000819. Yeying Zhang, et. Al., Transcription Factors SOD7/NAGL2 and DPA4/NGAL3 act redundantly to regulated seed size by directly repressing KLU expression in Arabidopsis thaliana, the plant cell, 2015, 27:620-63220. Yanjie Zhang, et. Al., MYB56 encoding a R2R2 MYB transcription factor regulateds seed size in arabidopsis thaliana, J. of Integrative plant biology, 2013, 55: 1166-117821. Gao MJ, et. Al., Repression of Seed Maturation Genes by a Trihelix Transcriptional Repressor in Arabidopsis Seedlings, The Plant Cell, 2009, 21:54-7122. Santos-Mendoza M, et.al., Deciphering gene regulatory networks that control seed development and maturation in Arabidopsis, The Plant Journal, 2008, 54:608-62024. Chhun T, et. Al., HSI2 Repressor Recruits MED13 and HDA6 to Down-Regulate Seed Maturation Gene Expression Directly During Arabidopsis Early Seedling Growth, Plant & Cell Physiology,2006, 57:1689-1706 26. Duan S. et. Al., MYB76 inhibites seed fatty acid accumulation in arabidopsis, Frontiers in Plant Science, 2017,doi:10.3389/fpls.2017.00226 27. Gao MJ, et. al., SCARECROW-LIKE15 interacts with HISTONE DEACETYLASE19 and is essential for repressing the seed maturation programme, Nature Communications, 2015, DOI: 10.1038/ncomms824328. To A., et. Al., A Network of Local and Redundant Gene Regulation Governs Arabidopsis Seed Maturation, The Plant Cell, 2006, 18:1642-165129. Tsuwamoto R. et. Al., Arabidopsis EMBRYOMAKER encoding an AP2 domain transcription factor plays a key role in developmental change from vegetative to embryonic phase, Plant Mol. Biol., 2010, 73:481-492 30. Shi L., et. Al., Arabidopsis glabra2 mutant seeds deficient in mucilage biosynthesis produce more oil, The Plant Journal, 2012, 69:37-4633. Ohto M. et. Al., Effects of APETALA2 on embryo, endosperm, and seed coat development determine seed size in Arabidopsis, Sex Plant Reproduction, 2009, 22:277-28934. Kunieda T. et. Al., NAC Family Proteins NARS1/NAC2 and NARS2/NAM in the Outer Integument Regulate Embryogenesis in Arabidopsis, The Plant Cell, 2008, 20:2632-2642 35. Adamczyk B. J., et. Al., The MADS domain factors AGL15 and AGL18 act redundantly as repressors of the floral transition in Arabidopsis, The Plant Journal, 2007, 50:1007-1019

Soybean Seed TFs† In DMVs

Soybean Seed Non-TF Genes In DMVs

Arabidopsis Seed TFs‡ In DMVs

Arabidopsis Seed Non-TF Genes In DMVs

.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted June 27, 2018. . https://doi.org/10.1101/356501doi: bioRxiv preprint