Genomic features of bacterial adaptation to plantslabs.bio.unc.edu/Dangl/Resources/gfobap_website/s41588... · 2017-12-18 · We also identified 64 plant-associated pro-tein domains

Articleshttps://doi.org/10.1038/s41588-017-0012-9

1DOE Joint Genome Institute, Walnut Creek, CA, USA. 2Department of Biology, University of North Carolina, Chapel Hill, NC, USA. 3Howard Hughes Medical Institute, Chevy Chase, MD, USA. 4Institute of Microbiology, ETH Zurich, Zurich, Switzerland. 5Department of Horticulture, Virginia Tech, Blacksburg, VA, USA. 6International Centre for Genetic Engineering and Biotechnology, Trieste, Italy. 7Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA. 8Department of Microbiology, University of Tennessee, Knoxville, TN, USA. 9Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA. 10School of Environmental and Forest Sciences, University of Washington, Seattle, WA, USA. 11Max Planck Institute for Developmental Biology, Tübingen, Germany. 12School of Natural Sciences, University of California, Merced, Merced, CA, USA. 13The Carolina Center for Genome Sciences, University of North Carolina, Chapel Hill, NC, USA. 14Department of Microbiology and Immunology, University of North Carolina, Chapel Hill, NC, USA. Present address: 15Department of Biology, Stanford University, Stanford, CA, USA; 16The Grassland College, Gansu Agricultural University, Lanzhou, Gansu, China; 17BD Technologies and Innovation, Research Triangle Park, NC, USA. Asaf Levy and Isai Salas Gonzalez contributed equally to this work. *e-mail: [email protected]; [email protected]; [email protected]

The microbiota of plants and animals have coevolved with their hosts for millions of years1–3. Through photosynthesis, plants serve as a rich source of carbon for diverse bacterial com-

munities. These include mutualists and commensals, as well as pathogens. Phytopathogens and growth-promoting bacteria have considerable effects on plant growth, health, and productivity4–7. Except for intensively studied relationships such as root nodula-tion in legumes8, T-DNA transfer by Agrobacterium9, and type III secretion–mediated pathogenesis10, the molecular mechanisms that govern plant–microbe interactions are not well understood. It is therefore important to identify and characterize the bacterial genes and functions that help microbes thrive in the plant environment. Such knowledge should improve the ability to combat plant diseases and harness beneficial bacterial functions for agriculture, with direct effects on global food security, bioenergy, and carbon sequestration.

Cultivation-independent methods based on profiling of marker genes or shotgun metagenome sequencing have considerably improved the overall understanding of microbial ecology in the plant environment11–15. In parallel, reduced sequencing costs have enabled the genome sequencing of plant-associated bacterial iso-lates at a large scale16. Importantly, isolates enable functional valida-tion of in silico predictions. Isolate genomes also provide genomic and evolutionary context for individual genes, as well as the poten-tial to access genomes of rare organisms that might be missed by

metagenomics because of limited sequencing depth. Although metagenome sequencing has the advantage of capturing the DNA of uncultivated organisms, multiple 16S rRNA gene surveys have reproducibly shown that the most common plant-associated bac-teria are derived mainly from four phyla13,17 (Proteobacteria, Actinobacteria, Bacteroidetes, and Firmicutes) that are amenable to cultivation. Thus, bacterial cultivation is not a major limitation in sampling of the abundant members of the plant microbiome16.

Our objective was to characterize the genes that contribute to bacterial adaptation to plants (plant-associated genes) and those genes that specifically aid in bacterial root colonization (root- associated genes). We sequenced the genomes of 484 new bacte-rial isolates and single bacterial cells from the roots of Brassicaceae, maize, and poplar trees. We combined the newly sequenced genomes with existing genomes to create a dataset of 3,837 high-quality, non-redundant genomes. We then developed a computational approach to identify plant-associated genes and root-associated genes based on comparison of phylogenetically related genomes with knowledge of the origin of isolation. We experimentally validated two sets of plant-associated genes, including a previously unrecognized gene family that functions in plant-associated microbe–microbe compe-tition. In addition, we characterized many plant-associated genes that are shared between bacteria of different phyla, and even between bacteria and plant-associated eukaryotes. This study represents

Genomic features of bacterial adaptation to plantsAsaf Levy 1, Isai Salas Gonzalez2,3, Maximilian Mittelviefhaus4, Scott Clingenpeel 1, Sur Herrera Paredes 2,3,15, Jiamin Miao5,16, Kunru Wang5, Giulia Devescovi6, Kyra Stillman1, Freddy Monteiro2,3, Bryan Rangel Alvarez1, Derek S. Lundberg2,3, Tse-Yuan Lu7, Sarah Lebeis8, Zhao Jin9, Meredith McDonald2,3, Andrew P. Klein2,3, Meghan E. Feltcher2,3,17, Tijana Glavina Rio1, Sarah R. Grant 2, Sharon L. Doty 10, Ruth E. Ley 11, Bingyu Zhao5, Vittorio Venturi6, Dale A. Pelletier7, Julia A. Vorholt4, Susannah G. Tringe 1,12*, Tanja Woyke 1,12* and Jeffery L. Dangl 2,3,13,14*

Plants intimately associate with diverse bacteria. Plant-associated bacteria have ostensibly evolved genes that enable them to adapt to plant environments. However, the identities of such genes are mostly unknown, and their functions are poorly characterized. We sequenced 484 genomes of bacterial isolates from roots of Brassicaceae, poplar, and maize. We then com-pared 3,837 bacterial genomes to identify thousands of plant-associated gene clusters. Genomes of plant-associated bacteria encode more carbohydrate metabolism functions and fewer mobile elements than related non-plant-associated genomes do. We experimentally validated candidates from two sets of plant-associated genes: one involved in plant colonization, and the other serving in microbe–microbe competition between plant-associated bacteria. We also identified 64 plant-associated pro-tein domains that potentially mimic plant domains; some are shared with plant-associated fungi and oomycetes. This work expands the genome-based understanding of plant–microbe interactions and provides potential leads for efficient and sustain-able agriculture through microbiome engineering.

NATuRE GENETICS | www.nature.com/naturegenetics

© 2017 Nature America Inc., part of Springer Nature. All rights reserved.

https://doi.org/10.1038/s41588-017-0012-9

mailto:[email protected]



http://orcid.org/0000-0003-2083-4072

http://orcid.org/0000-0002-6619-6320

http://orcid.org/0000-0001-9246-8337

http://orcid.org/0000-0002-7735-6692

http://orcid.org/0000-0002-9546-315X

http://orcid.org/0000-0002-9087-1672

http://orcid.org/0000-0001-6479-8427

http://orcid.org/0000-0002-9485-5637

http://orcid.org/0000-0003-3199-8654

http://www.nature.com/naturegenetics

Articles Nature GeNetics

a comprehensive and unbiased effort to identify and characterize candidate genes required at the bacteria–plant interface.

ResultsExpanding the plant-associated bacterial reference catalog. To obtain a comprehensive reference set of plant-associated bacterial genomes, we isolated and sequenced 191, 135, and 51 novel bacte-rial strains from the roots of Brassicaceae (91% from Arabidopsis thaliana), poplar trees (Populus trichocarpa and Populus deltoides), and maize, respectively (Methods, Table 1, Supplementary Tables 1–3). The bacteria were specifically isolated from the interior (endo-phytic compartment) or surface (rhizoplane) of plant roots, or from soil attached to the root (rhizosphere). In addition, we isolated and sequenced 107 single bacterial cells from surface-sterilized roots of A. thaliana. All genomes were assembled, annotated, and deposited in public databases and in a dedicated website (“URLs,” Supplementary Table 3, Methods).

A broad, high-quality bacterial genome collection. In addition to the newly sequenced genomes noted above, we collected 5,587 bacterial genomes belonging to the four most abundant phyla of plant-associated bacteria13 from public databases (Methods). We manually classified each genome as plant-associated, non-plant-associated (NPA), or soil-derived on the basis of its unambiguous isolation niche (Methods, Supplementary Tables 1 and 2). The plant-associated genomes included organisms isolated from plants or rhizospheres. A subset of the plant-associated bacteria was also annotated as ‘root-associated’ when isolated from the rhizoplane or the root endophytic compartment. Genomes from bacteria isolated from soil were considered as a separate group, as it is unknown whether these strains can actively associate with plants. Finally, the remaining genomes were labeled as NPA genomes; these were iso-lated from diverse sources, including humans, non-human animals, air, sediments, and aquatic environments.

We carried out stringent quality control to remove low-quality or redundant genomes (Methods). This led to a final dataset of 3,837 high-quality and nonredundant genomes, including 1,160 plant-associated genomes, 523 of which were also root-associated. We grouped these 3,837 genomes into nine monophyletic taxa to allow comparative genomics analysis among phylogenetically related genomes (Fig. 1a, Supplementary Tables 1 and 2, Methods, “URLs”).

To determine whether our genome collection from cultured isolates was representative of plant-associated bacterial communi-ties, we analyzed cultivation-independent 16S rDNA surveys and metagenomes from the plant environments of Arabidopsis11,12,

barley18, wheat, and cucumber14 (Methods). The nine taxa analyzed here account for 33–76% (median, 41%; Supplementary Table 4) of the total bacterial communities found in plant-associated envi-ronments and therefore represent a substantial portion of the plant microbiota, consistent with previous reports13,16,19.

Increased carbohydrate metabolism and fewer mobile elements in plant-associated genomes. We compared the genomes of bac-teria isolated from plant environments with those from bacteria of shared ancestry that were isolated from non-plant environments. We assumed that the two groups should differ in the set of accessory genes that evolved as part of their adaptation to a specific niche. Comparison of the size of plant-associated, soil, and NPA genomes showed that plant-associated and/or soil genomes were signifi-cantly larger than NPA genomes (P < 0.05, PhyloGLM and t-tests; Supplementary Fig. 1a, Supplementary Table 5). We observed this trend in six to seven of the nine analyzed taxa (depending on the test), representing all four phyla. Pangenome analyses of a few genera with plant-associated and NPA isolation sites showed that pangenome sizes were similar between plant-associated and NPA genomes (Supplementary Fig. 2).

Next, we examined whether certain gene categories are enriched or depleted in plant-associated genomes versus in their NPA coun-terparts, using 26 broad functional gene categories (Supplementary Table 6). We used the PhyloGLM test (Fig. 1b) and t-test (Supplementary Fig. 3) to detect enrichment. Two gene categories demonstrated similar phylogeny-independent trends suggestive of an environment-dependent selection process. The “Carbohydrate metabolism and transport” gene category was expanded in the plant-associated organisms of six taxa (Fig. 1b). This was the most expanded category in Alphaproteobacteria, Bacteroidetes, Xanthomonadaceae, and Pseudomonas (Supplementary Fig. 3). In contrast, mobile genetic elements (phages and transposons) were underrepresented in four plant-associated taxa (Fig. 1b and Supplementary Fig. 3). Plant-associated genomes showed increased genome sizes despite a reduction in the number of mobile elements that often serve as vehicles for horizontal gene transfer and genome expansion. A comparison of root-associated bacteria to soil bacteria showed less drastic changes than those seen between plant-associ-ated and NPA groups, as expected for organisms that live in more similar habitats (Fig. 1b and Supplementary Fig. 3).

Identification and validation of plant- and root-associated genes. We sought to identify specific genes enriched in plant- and root-associated genomes compared with NPA and soil-derived

Table 1 | Novel and previously sequenced genomes used in this analysis

Taxon Taxonomic rank Novel sequenced PA genomes

Scanned genomes

Genomes used in analysis

PA NPA Soil RA

Alphaproteobacteria1 Class 126 784 610 368 199 43 169

Burkholderiales1 Order 85 612 433 160 209 64 86

Acinetobacter1 Genus 4 926 454 7 442 5 3

Pseudomonas1 Genus 75 506 349 169 137 43 61

Xanthomonadaceae1 Family 15 264 147 110 26 11 26

Bacillales2 Order 54 664 454 97 185 172 54

Actinobacteria 13 NA 69 504 394 164 142 88 89

Actinobacteria 23 NA 19 845 587 29 526 32 18

Bacteroidetes4 Phylum 37 481 409 56 293 60 17

Total 484 5,586 3,837 1,160 2,159 518 5231Proteobacteria. 2Firmicutes. 3Actinobacteria phylum. PA, plant-associated bacteria; NPA, non-plant-associated bacteria; soil, soil-associated bacteria; RA, root-associated bacteria; NA, not available (an artificial taxon).




ArticlesNature GeNetics

GenusFamilyOrderProteo.Bacteroidetes

Actino. Actinobacteria 2*

Alphaproteobacteria

Proteo. β-proteo.

Proteo. γ-proteo. γ-proteo. AcinetobacterProteo.

Firmic. Bacilli Bacillales

Actino.PseudomonasProteo.

>10.5–1<0.5

>–0.5–1 to –0.5<–1

PhyloGLM estimate

GenusFamilyOrderProteo. AlphaproteobacteriaBacteroidetes

Actino. Actinobacteria 2*Proteo.

Proteo.Proteo.

Firmic. Bacillales

Actino.Proteo.

Non-significant

More genesin PA/RA

More genesin NPA/soil

PA vs. NPA

RA vs. soil

ClassificationNPA (n = 2,159)PA (n = 1,160)RA (n = 523)soil (n = 518)

Acinetobacter (n = 454)

Actinobacteria 1 (n = 394)Actinobacteria 2 (n = 587)

Bacillales (n = 454)

Bacteroidetes (n = 409)

Burkholderiales (n = 433)

Pseudomonas (n = 349)

Alphaproteobacteria (n = 610)

Xanthomonadaceae (n = 147)

Taxonomic groups

a

b

Tree scale: 0.1

0.5 M0.3 M

50 K

0.1 M

0.5 M

1.0 M # genes

# genes

Phylum Class

Phylum Class

2.0 M1.5

M1.0

M0.5 M

Carbohydrates

Signal transd

uction

Coenzymes

Amino acids

Cell wall,

cell m

embrane

Transcrip

tion

General functi

on

Translatio

n

Inorganic ions

Motili

ty

Secondary m

etabolites

Replicatio

n, repair,

recomb.

EnergyLip

ids

Nucleotid

es

Post-tra

nslatio

n

Cell divi

sion

Defense

RNA processi

ng

Chromatin

Unknown

Extrace

llular s

tructu

res

Tra�cking, se

cretio

n

Cytoskeleton

Prophages, transp

osons

1.0 M

0.5 M0.1 M

Carbohydrates

Signal transd

uction

Coenzymes

Amino acids

Cell wall,

cell m

embrane

Transcrip

tion

General functi

on

Translatio

n

Inorganic ions

Motili

ty

Secondary m

etabolites

Replicatio

n, repair,

recomb.

EnergyLip

ids

Nucleotid

es

Post-tra

nslatio

n

Cell divi

sion

Defense

RNA processi

ng

Chromatin

Unknown

Extrace

llular s

tructu

res

Tra�cking, se

cretio

n

Cytoskeleton

Prophages, transp

osons

XanthomonadaceaeXanthomon.

Pseudom.Pseudomon.

Burkholderiales

Actinobacteria 1*Pseudomon.

γ-proteo.

Moraxel.

PseudomonasPseudom.

Xanthomonadaceae

Pseudomon.

Xanthomon.

γ-proteo.

γ-proteo.γ-proteo.

Actinobacteria 1*

Burkholderialesβ-proteo.

Bacilli

AcinetobacterMoraxel.Pseudomon.

Fig. 1 | The genome dataset used in analysis, and differences in gene category abundances. a, The maximum-likelihood phylogenetic tree of 3,837 high-quality and nonredundant bacterial genomes, based on the concatenated alignment of 31 single-copy genes. The outer ring shows the taxonomic group, the central ring shows the isolation source, and the inner ring shows the root-associated (RA) genomes within plant-associated (PA) genomes. Taxon names are color-coded according to phylum: green, Proteobacteria; red, Firmicutes; blue, Bacteroidetes; purple, Actinobacteria. See “URLs” for the iTOL interactive phylogenetic tree. b, Differences in gene categories between plant-associated and NPA genomes (top) and between root-associated and soil-associated genomes (bottom) of the same taxon. Both heat maps indicate the level of enrichment or depletion based on a PhyloGLM test. Significant cells (color-coded according to the key) represent P values of < 0.05 (FDR-corrected). Pink-red cells indicate significantly more genes in plant-associated and root-associated genomes in the top and bottom heat maps, respectively. Histograms at the top and right of each heat map represent the total number of genes compared in each column and row, respectively. Asterisks indicate non-formal class names. “Carbohydrates” denotes the carbohydrate metabolism and transport gene category. Full COG category names for the x-axis labels are presented in Supplementary Table 6. Note that cells representing high absolute estimate values (dark colors) are based on categories of few genes and are therefore more likely to be less accurate. Phylum names are color-coded as in a. Xanthomon., Xanthomonadales; Pseudomon., Pseudomonadales; Pseudom., Pseudomonadaceae; Moraxel., Moraxellaceae.





genomes, respectively (Supplementary Fig. 4, Methods). First, we clustered the proteins and/or protein domains of each taxon on the basis of homology, using the annotation resources COG20, KEGG Orthology21, and TIGRFAM22, which typically comprise 35–75% of all genes in bacterial genomes23. To capture genes that do not have existing functional annotations, we also used OrthoFinder24 (after benchmarking; Supplementary Fig. 5) to cluster all protein sequences within each taxon into homology-based orthogroups. Finally, we clustered protein domains with Pfam25 (Methods, “URLs”). We used these five protein/domain-clustering approaches in parallel comparative genomics pipelines. Each protein/domain sequence was additionally labeled as originating from either a plant-associated genome or an NPA genome.

Next, we determined whether protein/domain clusters were sig-nificantly associated with a plant-associated lifestyle by using five independent statistical approaches: hypergbin, hypergcn (two ver-sions of the hypergeometric test), phyloglmbin, phyloglmcn (two phylogenetic tests based on PhyloGLM26), and Scoary27 (a stringent combined test) (Methods). These analyses were based on either gene presence/absence or gene copy number. We defined a gene as sig-nificantly plant-associated if at least one test showed that it belonged to a significant plant-associated gene cluster, and if it originated from a plant-associated genome. We defined significant NPA, root-associated, and soil genes in the same way. Significant gene clusters identified by the different methods had varying degrees of overlap (Supplementary Figs. 6 and 7). In general, we noted a high degree of overlap between plant-associated and root-associated genes and overlap between NPA and soil-associated genes (Supplementary Fig. 8). Overall, plant-associated genes were depleted from NPA genomes from heterogeneous isolation sources (Supplementary Figs. 9 and 10). Principal coordinates analysis with matrices that contained only the plant-associated and NPA genes derived from each method as features increased the separation of plant-associated from NPA genomes along the first two axes (Supplementary Fig. 11). We provide full lists of statistically significant plant-associated, root-associated, soil-associated, and NPA proteins and domains accord-ing to the five clustering techniques and five statistical approaches for each taxon in Supplementary Tables 7–15 (also see “URLs”).

To validate our predictions, we assessed the abundance patterns of plant-associated and root-associated genes in natural environ-ments. We retrieved 38 publicly available plant-associated, NPA, root-associated, and soil-associated shotgun metagenomes, includ-ing some from plant-associated environments that were not used for isolation of the bacteria analyzed here14,28,29 (Supplementary Table 16a). We mapped reads from these culture-independent metagenomes to plant-associated genes found with all statisti-cal approaches (Methods, Supplementary Figs. 12–16). Plant-associated genes in up to seven taxa were more abundant (P < 0.05, t-test) in plant-associated metagenomes than in NPA metagenomes (Fig. 2a, Supplementary Table 16b). Root-associated, soil-associ-ated, and NPA genes, in contrast, were not necessarily more abun-dant in their expected environments (Supplementary Table 16b).

In addition, we selected eight genes that were predicted to be plant-associated by multiple approaches (Supplementary Table 17a) for experimental validation via an in planta bacterial fitness assay (Methods). We inoculated the roots of surface-sterilized rice seedlings (n = 9–30 seedlings per experiment) with wild-type Paraburkholderia kururiensis M130 (a rice endophyte30) or a knock-out mutant strain for each of the eight genes. We grew the plants for 11 d and then collected and quantified the bacteria that were tightly attached to the roots (Methods, Supplementary Table 17b). Mutations in two genes led to fourfold to sixfold reductions in colo-nization (false discovery rate (FDR)-corrected Wilcoxon rank sum test, q < 0.1) relative to that by wild-type bacteria (Fig. 2b), without an observed effect on growth rate (Supplementary Fig. 17). These two genes encode an outer-membrane efflux transporter from the

nodT family and a Tir chaperone protein (CesT), respectively. It is plausible that the other six genes assayed function in facets of plant association not captured in this experimental context.

Functions for which coexpression of and cooperation between different proteins are needed are often encoded by gene operons in bacteria. We therefore tested whether our methods could cor-rectly predict known plant-associated operons. We grouped plant-associated and root-associated genes into putative plant-associated and root-associated operons on the basis of their genomic proxim-ity and orientation (Supplementary Fig. 4, Methods, “URLs”). This analysis yielded some well-known plant-associated functions, such as those of the nodABCSUIJZ and nifHDKENXQ operons (Fig. 2c,d). Nod and Nif proteins are integral for biological nitrogen cycling and mediate root nodulation31 and nitrogen fixation32, respectively. We also identified the biosynthetic gene cluster for the precursor of the plant hormone gibberellin33,34 (Fig. 2e). Other known plant-associ-ated operons identified are related to chemotaxis35, secretion systems such as T3SS36 and T6SS37, and flagellum biosyntheis38–40 (Fig. 2f–i).

Thus, we identified thousands of plant-associated and root-associated gene clusters by using five different statistical approaches (Supplementary Table 18) and validated them by means of compu-tational and experimental approaches, broadening our understand-ing of the genetic basis of plant–microbe interactions and providing a valuable resource to drive further experimentation.

Protein domains reproducibly enriched in diverse plant-asso-ciated genomes. Plant-associated and root-associated proteins and protein domains conserved across evolutionarily diverse taxa are potentially pivotal to the interaction between bacteria and plants. We identified 767 Pfam domains as significant plant-asso-ciated domains in at least three taxa, on the basis of multiple tests (Supplementary Table 19a). Below we elaborate on a few domains that were plant-associated or root-associated in all four phyla. Two of these domains, a DNA-binding domain (pfam00356) and a ligand-binding (pfam13377) domain, are characteristic of the LacI transcription factor (TF) family. These TFs regulate gene expres-sion in response to different sugars41, and their copy numbers were expanded in the genomes of plant-associated and root-associated bacteria in eight of the nine taxa analyzed (Fig. 3a). Examination of the genomic neighbors of lacI-family genes identified strong enrichment for genes involved in carbohydrate metabolism and transport in all of these taxa, consistent with their expected regula-tion by a LacI-family protein41 (Supplementary Fig. 18). We ana-lyzed the promoter regions of these putative regulatory targets of LacI-family TFs, and identified three AANCGNTT palindromic octamers that were statistically enriched in all but one taxon, and which may serve as the TF-binding site (Supplementary Table 20). These data suggest that accumulation of a large repertoire of LacI-family-controlled regulons is a common strategy across bacterial lineages during adaptation to the plant environment.

Another domain, the metabolic domain aldo-keto reductase (pfam00248), was enriched in the genomes of plant-associated and root-associated bacteria from eight taxa belonging to all four phyla investigated (Fig. 3b). This domain is involved in the metabolic con-version of a broad range of substrates, including sugars and toxic carbonyl compounds42. Thus, bacteria that inhabit plant environ-ments may consume similar substrates. Additional plant-associated and root-associated proteins and domains that were enriched in at least six taxa are described in Supplementary Fig. 19.

We also identified domains that were reproducibly enriched in NPA and/or soil-associated genomes, including many domains of mobile genetic elements (Supplementary Fig. 20).

Putative plant protein mimicry by plant- and root-associated proteins. Convergent evolution and horizontal transfer of protein domains from eukaryotes to bacteria have been suggested for some





Bacillales Pseudomonas

Xanthomonadaceae

Alphaproteobacteria

Bacteroidetes

Actinobacteria 1

Alphaproteobacteria, Bradyrhizobium sp. ARR65

Burkholderiales, Burkholderia kururiensis M130G118DRAFT_01458

NodZ

g

A PA gene according to hypergbin

A PA gene according to phyloglmbin

A PA gene according to Scoary

BraARR65DRAFT_00043940

Alphaproteobacteria, Rhizobium etli bv. mimosae IE4771

mlr6364

mlr6365

mlr6366

mlr6367

mlr6368

mlr6368

mlr6370

Alphaproteobacteria, Rhizobium leguminosarum bv. viciae 3841

mcpB

cheY

cheW

cheA

Xanthomonadaceae, Xanthomonas axonopodis pv. vasculorum NCPPB 900

IE4771_PB00363 (gene id2585987913)

RL4028

h Burkholderiales, Variovorax sp. Root411Ga0102102_11238

ImpH

ImpF

ImpG

ImpE

Hcp

ImpM

ImpL

ImpK

ImpJ

Alphaproteobacteria, Ensifer medicae WSM1115 iSinmedDRAFT_3917

d

nifH

nifD

nifK

nifE

nifN

nifX

c

e

f

NolO

nifQ

cheR

cheB

cheW

T3SS protein U

T3SS protein V

HpaB

YscJ/HrcJ fam

ily

HpaA

HpaP

T3SS protein QT3SS protein RT3SS protein S

T3SS protein D

HrpB2

HrpB1

HrpE

HrpB4

T3SS protein L

T3SS protein N

HrpB7

T3SS protein T

T3SS protein C

VasG

VgrG

FlgF

FlgBFlgC

FlgGFlgAFlgIM

otEFlgHFliLFliP

Flagellin

Flagellin

Flagellin

Flagellin

FlgE

FlgK

FlgL

flaFflbTflgD

fliI

flhA

fliR

Hum

an g

ut (n

= 3

)O

cean

(n =

1)A

rctic

sal

t lak

e (n

= 3

)A

t soi

l (n

= 2)

At r

hizo

. (n

= 4)

Whe

at rh

izop

lane

(n =

3)

Whe

at s

oil (

n =

2)

Cuc

umbe

r soi

l (n

= 2)

Cuc

um. r

hizo

plan

e (n

= 3

)Po

plar

soi

l (n

= 5)

Popl

ar rh

izos

pher

e (n

= 5

)Po

plar

end

osph

ere

(n =

5)

NPA

, all

(n =

7)

Soil,

all

(n =

11)

PA, a

ll (n

= 2

0)

RA, a

ll (n

= 11

)

log 10

(CFU

/g o

f roo

t)

5.05.56.06.57.07.58.0

Nor

mal

ized

# re

ads

a

b

Nodulation

Nitrogenfixation

Ent-kaurene biosynthesis(gibberelin precursor)

Chemotaxis

Type III secretion systemGa0097773_106954

Type VI secretion system

Flagellum biosynthesis

–1

3

4

5

1

2

3

1

2

3

4

0

1

2

3

4

–4

–2

0

2

0

3.5

4.5

5.5

Mean valueMedian value

P = 0.03

P = 0.001

P = 0.006

P = 0.046

P = 0.004

P = 0.0006

P = 0.045

NodJ

NodI

NodU

NodS

NodC

NodB

NodA

Metagenome samples: NPA Soil PA RA (also PA)

Hum

an g

ut (n

= 3

)O

cean

(n =

1)A

rctic

sal

t lak

e (n

= 3

)A

t soi

l (n

= 2)

At r

hizo

. (n

= 4)

Whe

at s

oil (

n =

2)W

heat

rhiz

opla

ne (n

= 3

)C

ucum

ber s

oil (

n =

3)C

ucum

. rhi

zopl

ane

(n =

3)

Popl

ar s

oil (

n =

5)Po

plar

rhiz

osph

ere

(n =

5)

Popl

ar e

ndos

pher

e (n

= 5

)N

PA, a

ll (n

= 7

)PA

, all

(n =

20)

Soil,

all

(n =

11)

RA, a

ll (n

= 11

)

Wild

type

G118DRAFT

_05604

G118DRAFT

_03668

Fig. 2 | Validation of predicted plant-associated genes by multiple approaches. a, Plant-associated (PA) genes, which were predicted from isolate genomes, were more abundant in PA metagenomes than in NPA metagenomes. Reads from 38 shotgun metagenome samples were mapped to significant PA, NPA, RA, and soil-associated genes predicted by Scoary. P values are indicated for the significant differences between PA and NPA genes or RA and soil-associated genes in each taxon (two-sided t-test). Full results and an explanation for normalization are presented in Supplementary Fig. 14. b, Results of a rice root colonization experiment using wild-type Paraburkholderia kururiensis M130 or knockout mutants for two predicted plant-associated genes. Two mutants showed reduced colonization compared with the wild type: G118DRAFT_05604 (q-value = 0.00013, Wilcoxon rank sum test), which encodes an outer membrane efflux transporter from the nodT family, and G118DRAFT_03668 (q-value = 0.0952, Wilcoxon rank sum test), a Tir chaperone protein (CesT). Each point represents the average count of a minimum of three to six plates derived from the same plantlet, expressed as colony-forming units (CFU) per gram of root. c–i, Examples of known functional plant-associated operons captured by different statistical approaches. The plant-associated genes are highlighted by shaded bars, colored according to the key. c, Nod genes. d, NIF genes. e, Ent-kaurene (gibberelin precursor). f, Chemotaxis proteins in bacteria from different taxa. g, Type III secretion system. h, Type VI secretion system, including the imp genes (impaired in nodulation). i, Flagellum biosynthesis in Alphaproteobacteria. Labels show the gene symbol or the protein name for which such information was available.





microbial effector proteins that are secreted into eukaryotic host cells to suppress defense and facilitate microbial proliferation43–45. We searched for new candidate effectors or other functional plant-protein mimics. We retrieved a set of significant plant-associated and root-associated Pfam domains that were reproducibly predicted by multiple approaches or in multiple taxa, and we cross-referenced these with protein domains that were also more abundant in plant genomes than in bacterial genomes (Methods). This analysis yielded 64 plant-resembling plant-associated and root-associated domains (PREPARADOs) encoded by 11,916 genes (Supplementary Fig. 21, Supplementary Table 21). The number of PREPARADOs was four-fold higher than the number of domains that overlapped repro-ducible NPA/soil-associated domains and plant domains (n = 15). The PREPARADOs were relatively abundant in genomes of plant-associated Bacteroidetes and Xanthomonadaceae ( > 0.5% of all domains on average; Supplementary Fig. 22). Some PREPARADOs were previously described as domains within effector proteins, such as Ankyrin repeats46, regulator of chromosome condensation

repeat (RCC1)47, leucine-rich repeat (LRR)48, and pectate lyase49. PREPARADOs from plant genomes were enriched 3–14-fold (P < 10−5, Fisher’s exact test) as domains predicted to be ‘integrated effector decoys’ when fused to plant intracellular innate immune receptors of the NLR class50–53 (compared with two random domain sets; Methods, Supplementary Figs. 21 and 23, Supplementary Table 21). We found that 2,201 bacterial proteins that encode 17 of the 64 PREPARADOs shared ≥ 40% identity across the entire pro-tein sequence with eukaryotic proteins from plants, plant-associ-ated fungi, or plant-associated oomycetes, and therefore are likely to maintain a similar function (Supplementary Fig. 24, Supplementary Tables 21 and 22). The varied phylogenetic distribution among this protein class could have resulted from convergent evolution or from cross-kingdom horizontal gene transfer between phylogenetically distant organisms subjected to the shared selective forces of the plant environment.

Seven PREPARADO-containing protein families were character-ized by N-terminal eukaryotic or bacterial signal peptides followed

0.000

0.002

0.004

0.006

0.008

0.010

0.000

0.001

0.002

0.003

0.004

0.005

Transcriptional regulator, LacI family

350

Periplasmic binding protein-like,ligand binding (pfam13377)

LacI, DNA-binding domain(pfam00356)

Actino. 1

Alphaprot.

Burkho. Pseud. Xanthom.

a

1 320

Aldo-keto reductase (pfam00248)b

Alphaprot. Bacillales

Bactero. Burkho. Pseud. Xanthom.

Frac

tion

of a

ll do

mai

ns

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.000

0.001

0.002

0.003

0.004

Bactero.

Frac

tion

of a

ll do

mai

ns

NPA

PA

RA

Soil

BacillalesActino. 1

Actino. 2

Actino. 2

1

Cla

ssifi

ed a

s si

gnifi

cant

Dis

trib

utio

n in

gen

omes

P = 3 × 10–12

P = 7 × 10–14

P = 1 × 10–8

P = 3 × 10–7

P = 1 × 10–8P = 6 × 10–7

P = 0.01

P = 0.003

P = 0.002

P = 0.001P = 0.04

P = 3 × 10–5

P = 2 × 10–27

P = 7 × 10–7

P = 3 × 10–49

P = 3 × 10–5

P = 0.004P = 0.03

P = 4 × 10–6

P = 3 × 10–14

P = 7 × 10–27

P = 0.03P = 0.002

P = 0.008

P = 0.001

Fig. 3 | Proteins and protein domains that were reproducibly enriched as plant-associated or root-associated in multiple taxa. We compared the occurrence of protein domains (from Pfam) between plant-associated (PA) and NPA bacteria and between root-associated (RA) and soil-associated bacteria. Color-coding is as in Fig. 1a. a, Transcription factors with LacI (Pfam00356) and periplasmic-binding protein domains (Pfam13377). These proteins are often annotated as COG1609. b, Aldo-keto reductase domain (Pfam00248). Proteins with this domain are often annotated as COG0667. We used a two-sided t-test to test for the presence of the genes in a and b in genomes that shared the same label and to verify the enrichment reported by the various tests. FDR-corrected P values are shown for significant results (q-value < 0.05). Colored circles indicate the number of different statistical tests ( ≤ 5) supporting plant, non-plant, root, or soil association of a gene or domain, with each circle representing one test. Gene illustrations above each graph represent random protein models. Note that a and b each contain two graphs because of the different scales. Actino., Actinobacteria; Alphaprot., Alphaproteobacteria; Burkho., Burkholderiales; Bactero., Bacteroidetes; Pseud., Pseudomonas; Xanthom., Xanthomonadaceae. Box-and-whisker plots show the median (center lines), 25th and 75th percentiles (box edges), extreme data points within 1.5 times the interquartile range from the box edge (whiskers), and outliers (isolated data points). Full results are in Supplementary Table 19.





by a PREPARADO dedicated to carbohydrate binding or metabo-lism (Supplementary Table 21). One of these domains, Jacalin, is a mannose-binding lectin domain that is found in 48 genes in the A. thaliana genome, compared with three genes in the human genome25. Mannose is found on the cell wall of different bacterial and fungal pathogens and could serve as a microbial-associated molecular pattern that is recognized by the plant immune system54–61. We identified a family of ~430-amino-acid-long microbial proteins with a signal peptide followed by a functionally ill-defined endonu-clease/exonuclease/phosphatase family domain (pfam03372), and ending with a Jacalin domain (pfam01419). This domain architec-ture is absent in plants but is found in diverse microorganisms, many of which are phytopathogens, including Gram-negative and Gram-positive bacteria, fungi from the Ascomycota and Basidiomycota phyla, and oomycetes (Fig. 4). We speculate that these microbial lectins may be secreted to outcompete plant immune receptors for mannose-binding on the microbial cell wall, effectively serving as camouflage.

We thus discovered a large set of protein domains that are shared between plants and the microbes that colonize them. In many cases the entire protein is conserved across evolutionarily distant plant-associated microorganisms.

Co-occurrence of plant-associated gene clusters. We identified numerous cases of plant-associated gene clusters (orthogroups) that demonstrate high co-occurrence between genomes (“URLs”). When the plant-associated genes were derived by phylogeny-aware tests (i.e., PhyloGLM and Scoary), they were candidates for inter-taxon horizontal gene transfer events. For example, we identified a cluster predicted by Scoary of up to 11 co-occurring genes (mean pairwise Spearman correlation: 0.81) in a flagellum-like locus from sporadically distributed plant-associated or soil-associated genomes across 12 different genera in Burkholderiales (Fig. 5). Two of the annotated flagellar-like proteins, FlgB (COG1815) and FliN (pfam01052), are also encoded by plant-associated genes in Actinobacteria 1 and Alphaproteobacteria taxa. Six of the remaining genes encode hypothetical proteins, all but one of which are specific to Betaproteobacteria, suggestive of a flagellar structure variant that evolved in this class in the plant environment. Flagellum-mediated motility or flagellum-derived secretion systems (for example, T3SS) are important for plant colonization and virulence39,40,62,63 and can be horizontally transferred64.

Novel putative plant- and root-associated gene operons. In addi-tion to successfully capturing several known plant-associated oper-ons (Fig. 2c–i), we also identified putative plant-associated bacterial operons (“URLs”). Two previously uncharacterized plant-associated gene families were conspicuous. These genes are organized in mul-tiple loci in plant-associated genomes, each with up to five tandem gene copies. They encode short, highly divergent, high-copy-number proteins that are predicted to be secreted, as explained below. These two plant-associated protein families never co-occurred in the same genome, and their genomic presence was perfectly correlated with lifestyles of pathogenic or nonpathogenic bacteria of the genus Acidovorax (order Burkholderiales) (Fig. 6a). We named the gene families present in non-pathogens and pathogens Jekyll and Hyde, respectively, after the characters in Robert Louis Stevenson’s classic novel.

The typical Jekyll gene is 97 amino acids long, contains an N-terminal signal peptide, lacks a transmembrane domain, and, in 98.5% of cases, appears in non-pathogenic plant-associated or soil-associated Acidovorax isolates (Fig. 6a, Supplementary Fig. 25d, Supplementary Table 23a). A single genome may encode up to 13 Jekyll gene copies (Fig. 6a) distributed in up to nine loci (Supplementary Table 23a). We recently isolated four Acidovorax strains from the leaves of naturally grown Arabidopsis16. Even these nearly identical isolates carried hypervariable Jekyll loci that were substantially more divergent than neighboring genes and included copy-number variations and various mutations (Fig. 6b, Supplementary Fig. 25, Supplementary Table 24).

The Hyde putative operons, in contrast, are composed of two distinct gene families unrelated to Jekyll. A typical Hyde1 protein has 135 amino acids and an N-terminal transmembrane helix. Hyde1 proteins are also highly variable, as demonstrated by copy-number variation, sequence divergence, and intralocus transposon insertions (Fig. 6a,c, Supplementary Fig. 26a–c, Supplementary Table 23b). Hyde1 was found in 99% of cases in phytopathogenic Acidovorax. These genomes carried up to 15 Hyde1 gene copies distributed in up to ten loci (Fig. 6a, Supplementary Table 23b). In 70% of cases Hyde1 was located directly downstream from a more conserved ~300-amino-acid-long plant-associated protein-coding gene that we named Hyde2 (Fig. 6c,d, Supplementary Table 23d). We identified loci with Hyde2 followed by Hyde1-like genes in dif-ferent members of the Proteobacteria phylum. These contained a highly variable Hyde1-like protein family that maintained only the

Paenibacillus sp. PAMC 26794, Firmicutes

0.1

10021

36

23

30

91

89

84

46

100

100Oryza sativa japonica group

JacalinDirigent

Oryza sativa japonica group

Protein kinase, catalyticdomain

Arabidopsis thaliana

F-box associated 1

Arabidopsis thaliana

Endonuclease/exonuclease/phosphatase family Signal peptide, Gram– bacteria

Magnaporthe oryzae 70-15, Ascomycota

Aspergillus oryzae 3.042, Ascomycota

Fomitopsis pinicola FP-58527 SS1, Basidiomycota

Rhizoctonia solani AG-3 Rhs1AP, Basidiomycota

Signal peptide, Gram+ bacteriaSignal peptide, eukaryotes

Phytophthora parasitica, oomycetes

Leptonema illini DSM 21528, Spirochaetes

Streptomyces sp. 303MFCol5.2, Actinobacteria

Xanthomonas axonopodispv. citri str. 306, Proteobacteria

Bacteria: PA, NPA, soilPA fungiPA oomycetesPlants

XP_015619049

BAF14474

NP_191518.1

NP_176220.1

WP_017689535.1

XP_003714330.1

EIT77169.1

EPS95407.1

EUC62117.1

ETM02225.1

EHQ06050.1

WP_020125377.1

AAM38968.1

Fig. 4 | A protein family shared by plant-associated bacteria, fungi, and oomycetes that resemble plant proteins. A maximum-likelihood phylogenetic tree of representative proteins with Jacalin-like domains across plants and plant-associated (PA) organisms. Endonuclease/exonuclease/phosphatase-Jacalin proteins are present across PA eukaryotes (fungi and oomycetes) and PA bacteria. In most cases these proteins contain a signal peptide in the N terminus. The Jacalin-like domain is found in many plant proteins, often in multiple copies. The protein accession is shown above each protein illustration.





Burkholderiales_OG0008862 (hyp.)

Burkholderiales_OG0008861 (FliN)Burkholderiales_OG0008863 (FlgA)

Burkholderiales_OG0008350 (hyp.)Burkholderiales_OG0009542 (hyp.)

Burkholderiales_OG0008351 (FlgB) Burkholderiales_OG0007689 (hyp.)

Burkholderiales_OG0007460 (hyp.)Burkholderiales_OG0010507 (hyp.)

Burkholderiales_OG0010506 (FliO)

Burkholderiales_OG0004311 (RHS)

ClassificationNPA

PA

RA

Soil

Tree scale: 0.1

Lim

noha

bita

ns_s

p._R

im28

Xyloph

ilus_

sp._L

eaf2

20

Bordetella_bronchiseptica_E014

Acidovorax_konjaci_DSM_7481

Bordetella_bronchiseptica_10580

Ralstonia_sp._AU12-08_(IVD)

Polar

omon

as_s

p._O

V174

Otto

wia

_thi

ooxy

dans

_DSM

_146

19

Alicycliphilus_denitrificans_BC

Cupriavidus_sp._AMP6

Comamonas_testosteroni_S44

Pelo

mon

as_s

p._R

oot1

444

Burkholderia_mallei_2002721280

Ralstonia_sp._25MFCol4.1

Rhiz

obac

ter_

sp._

OV

335

Janthinobacterium_sp._344

Ralstonia_solanacearum_23-10BR

Burkholderia_oklahomensis_BDU

Massilia_sp._LC238

Bordetella_bronchiseptica_D756

Rhiz

obac

ter_

sp._

Root

16D

2

Burkholderia_cepacia_383

Met

hylib

ium

_sp.

_Roo

t127

2

Burkholderia_mallei_BMQ

Variovorax_sp._770b2

Burkholderia_cenocepacia_MC0-3Burkholderia_thailandensis_USAMRU_Malaysia__20

Hyl

emon

ella

_gra

cilis

_Nia

gara

_R

Burk

hold

eria

_oxy

phila

_NBR

C_10

5797

Bordetella_sp._FB-8,_DSM_24873

Bordetella_avium_197N

Burkholderia_pseudomallei_NCTC_13178

Acidovorax_avenae_avenae_RS-1

Burk

hold

eria

_spr

entia

e_W

SM50

05

Alcaligenes_faecalis_faecalis_NCIB_8687

Burkholderia_pseudomallei_1106a

Oligella_urethralis_DNF00040

Cupriavidus_taiwanensis_STM6070

Burkholderia_ambifa

ria_MC40-6

Burk

hold

eria

_sp.

_JPY

251

Burkholderia_mallei_BMY

Ralstonia_solanacearum_M

olK2

Ralstonia_solanacearum_Rs-09-161

Taylorella_equigenitalis_MCE9

Acidovorax_sp._JHL-3

Burkholderia_multiv

orans_ATCC_17616

Burkholderi

a_sp

._lig3

0

Burkholderiales_bacterium_1_1_47

Duganella_sp._Leaf61

Burk

hold

eria

_kur

urie

nsis

_thi

ooxy

dans

_NBR

C_1

0710

7

Ralstonia_pickettii_27511

Burkholderia_pseudomallei_NAU20B-16

Verminephrobacter_eiseniae_EF01-2

Burk

hold

eria

_fun

goru

m_N

BRC_

1024

89

Sutterella_wadsworthensis_2_1_59BFAA

Acidovorax_sp._NO-1

Polar

omon

as_s

p._JS

666

Variovorax_sp

._Root434

Comamonas_granuli_NBRC_101663

Cupriavidus_taiwanensis_STM

6018

Bordetella_bronchiseptica_A1-7

Burkholderia_cenocepacia_869T2

Acidovorax_ebreus_TPSY

Comamonas_testosteroni_sv._Ba_CNB-1

Herbaspirillum_sp._YR522

Burkholderi

a_gla

dioli_BSR

3

Burkholderia_pseudomallei_MSHR5848


Achromobacter_piechaudii_HLE

Parasutterella_excrementihominis_YIT_11859

Bordetella_holmesii_35009

Met

hylib

ium

_pet

role

iphi

lum

_PM

1

Acidovorax_delafieldii_DSM_64

Ralstonia_pickettii_OR214

Rhiz

obac

ter_

sp._

Root

29

Albi

dife

rax_

sp._O

V413

Variovorax_sp._Root318D1

Variovorax_paradoxus_B4

Achromobacter_xylosoxidans_NH44784-1996

Delftia_acidovorans_SPH-1

Burkholderia

_dolosa

_PC543_(L

MG_19

468)

Cupriavidus_sp._UYPR2.512

Chitinimonas_koreensis_DSM_17726

Bordetella_bronchiseptica_KM22

Burkholderia_multiv

orans_CGD2M

Ralstonia_pickettii_12DBurk

hold

eria_

bryo

phila

_376

MFS

ha3.1

Burkholderia_pseudomallei_668

Acidovorax_avenae_avenae_ATCC_19860

Taylorella_equigenitalis_14/56

Burk

hold

eria

_nod

osa_

DSM

_216

04

Acidovorax_caeni_DSM_19327

Acidovorax_oryzae_ATCC_19882

Massilia_alkalitolerans_DSM_17462

Burk

hold

eria

_ter

rae_

NBR

C_10

0964

Ralstonia_sp._UNC404CL21Col

Comamonas_testosteroni_ATCC_11996

Burkholderia_mallei_GB8_horse_4

Burk

hold

eria

_sp.

_MP-

1

Variovo

rax_sp

._OK605

Herbaspirillum_seropedicae_SmR1

Alcaligenes_sp._DH1f

Delftia_acidovorans_CCUG_15835

Hyl

emon

ella

_gra

cilis

_Can

ale-

Paro

la_D

4,_A

TCC_

1962

4

Pandoraea_sp._RB-44

Burk

hold

eria

_kur

urie

nsis

_M13

0

Pseu

dorh

odof

erax

_sp.

_Lea

f267

Ralstonia_solanacearum_Rs-10-244

Rhiz

obac

ter_

sp._

Root

1221

Variovorax_paradoxus_S110

Burkholderia_cenocepacia_H111

Burkholderia_cepacia_DDS_7H-2

Met

hylib

ium

_sp.

_CF4

68

Burkholderia_mallei_NCTC_10247

Massilia_timonae_CCUG_45783

Taylorella_asinigenitalis_MCE3

Collimonas_sp._OK242

Cupriavidus_sp._BIS7

Variovo

rax_s

p._350M

FTsu

5.1

Rhiz

obac

ter_

sp._

Root

404

Bordetella_trematum

_CCUG_13902

Herbaspirillum_seropedicae_Os45

Bordetella_bronchiseptica_1289

Com

amon

adac

eae_

bact

eriu

m_U

RHA0

028

Bordetella_pertussis_CHLA

-11

Ralstonia_eutropha_JMP134

Duganella_zoogloeoides_ATCC_25935

Burk

hold

eria

_sp.

_CCG

E100

2

Massilia_sp._Root418


Bordetella_pertussis_Tohama_I

Ralstonia_solanacearum_FQ

Y_4

Pandoraea_pnomenusa_RB38

Pelo

mon

as_s

p._R

oot6

62

Sutterella_wadsworthensis_3_1_45B

Ralstonia_solanacearum_Po82

Burk

hold

eria

_car

iben

sis_M

BA4

Sutterella_wadsworthensis_HGA0223

Janthinobacterium_sp._Marseille



Janthinobacterium_sp._CG3

Burkholderia_cenocepacia_K56-2Valvano

Burkholderia

_multiv

orans_CGD1

Noviherbaspirillum_sp._Root189

Basilea_psittacipulmonis_DSM_24701

Burk

holde

ria_s

p._W

SM22

30


Burkholderia_vietnamiensis_G4

Cupriavidus_taiwanensis_LM

G_19424



Burk

hold

eria

_sp.

_BT0

3

Pelo

mon

as_s

p._R

oot1

237

Comamonas_aquatica_NBRC_14918

Ralstonia_solanacearum_CM

R15

Janthinobacterium_sp._RA13

Burkholderia_pseudomallei_BPC006

Burk

hold

eria

_ban

nens

is_N

BRC_

1038

71

Rubr

iviv

ax_b

enzo

atily

ticus

_ATC

C_B

AA

-35



Burkholderia_sp._MSh1

Ralstonia_solanacearum_CFBP2957

Cupriavidus_sp._YR651

Achromobacter_xylosoxidans_AXX-A

Derxia_gummosa_DSM_723

Burkholderia_pseudomallei_1026b

Burk

holde

ria_p

heno

lirup

trix_

BR34

59

Massilia_sp._Leaf139

Rhod

ofer

ax_s

aiden

bach

ensis

_ED16

Variovorax_paradoxus_4MFCol3.1

Bordetella_pertussis_CS

Delftia_sp._Cs1-4

Janthinobacterium_sp._551a

Acidovorax_sp._Leaf160

Variovorax_paradoxus_349MFTsu5.1


Cupriavidus_necator_N-1,_ATCC_43291


Burkh

older

ia_glu

mae_B

GR1


Thiomonas_sp._FB-6,_D

SM_25805

Ralstonia_solanacearum_N

CPPB_282

Acidovorax_sp._KKS102

Taylorella_asinigenitalis_ATCC_700933

Ram

libac

ter_

sp._L

eaf4

00

Burk

hold

eria

_sp.

_UYP

R1.4

13

Comamonas_badia_DSM_17552

Ralstonia_solanacearum_G

MI1000

Herbaspirillum_massiliense_JC206

Polar

omon

as_s

p._YR5

68

Burkholderia_thailandensis_MSMB121

Burkholderia_pseudomallei_HBPUB10134a

Burkholderia_cenocepacia_gv._III_J2315

Janthinobacterium_agaricidamnosum_W1R3





Pandoraea_sp._E26

Comamonas_testosteroni_TK102

Advenella_mimigardefordensis_DPN7,_DSM_17166_(MIMI)

Burkholderi

a_ce

nocepac

ia_DDS_2

2E-1

Lim

noha

bita

ns_s

p._R

im47

Burk

hold

eria

_sp.

_RPE

64



Burkholderia_andropogonis_Ba3549

Oligella_urethralis_DSM_7531

Acidovorax_valerianellae_DSM_16619


Ralstonia_solanacearum_N

CPPB909

Burk

hold

eria

_sp.

_WSM

4176

Achromobacter_sp._Root170


Burkholderia_thailandensis_H0587

Burkholderia_cepacia_GG4

Pelo

mon

as_s

p._R

oot1

217

Variovo

rax_sp

._CF313

Burk

hold

eria

_sp.

_JC

2949

Burkholderia_cepacia_Bu72

Acidovorax_radicis_N35

Pseudoduganella_violaceinigra_DSM_15887

Duganella_sp._Leaf126

Pseu

dorh

odof

erax

_sp.

_Lea

f265

Burkholderia_sp._WP42

Burk

hold

eria

_aci

dipa

ludi

s_N

BRC_

1018

16

Burkholderia_mallei_JHU

Burkholderia_thailandensis_E444



Burkholderia_cepacia_AMMD

Pusillimonas_noertemannii_BS8

Polar

omon

as_n

apht

halen

ivora

ns_C

J2

Polynucleobacter_necessarius_asymbioticus_QLW-P1DM

WA-1

Burkholderia_pseudomallei_Pakistan_9



Taylorella_equigenitalis_ATCC_35865

Burkholderia

_multiv

orans_DDS_15

A-1

Pandoraea_sp._SD6-2

Acidovorax_citrulli_tw6

Burk

hold

eria_

sp._C

h1-1

Burkholderia_thailandensis_MSMB43

Bordetella_holmesii_ATCC_51541

Taylorella_asinigenitalis_14/45

Thiomonas_sp._FB-C

d,_DSM

_25617

Burkholderia_mallei_SAVP1

Burkh

older

ia_sp

._A1

Burk

hold

eria

_phy

mat

um_S

TM81

5

Alicycliphilus_denitrificans_K601

Met

hylib

ium

_sp.

_CF0

59


Janthinobacterium_sp._OK676

Achromobacter_xylosoxidans_NBRC_15126,_ATCC_27061


Bordetella_petrii_Se-1111R,_DSM_12804

Bordetella_holmesii_H

558

Delftia_sp._RIT313

Herbaspirillum_huttiense_putei_IAM_15032

Curvib

acte

r_gr

acilis

_ATCC_B

AA-807

Lautropia_mirabilis_A

TCC

_51599

Acidovorax_citrulli_DSM_17060


Burkholderia_vietnamiensis_WPB

Burkholderia_sp._KJ006

Pseu

dorh

odof

erax

_sp.

_Lea

f274

Burkholderia_pseudomallei_BEO

Burkholderia_pseudomallei_1710a

Burk

hold

eria

_fer

raria

e_N

BRC_

1062

33

Polar

omon

as_g

lacial

is_R3

-9

Ralstonia_solanacearum_SD

54

Comamonas_composti_DSM_21721


Burkholderi

a_gla

dioli_NBRC_13

700

Massilia_niastensis_DSM_21313

Acidovorax_sp._Root267

Duganella_sp._CF402

Comamonas_testosteroni_KF-1

Polynucleobacter_sp._MW

H-MoK4

Burk

hold

eria_

xeno

vora

ns_L

B400


Pusillimonas_sp._T7-7

Burkholderia_sp._B14


Burkholderia_pyrrocinia_Lyc2

Giesbergeria_anulus_ATCC_35958

Herbaspirillum_sp._GW103Collimonas_sp._OK412

Bordetella_bronchiseptica_B18-5_(C3)

Polar

omon

as_s

p._CF3

18

Variovorax_paradoxus_295MFChir4

.1

Variovorax_sp

._Root473


Burkholderia_pseudomallei_MSHR5848Acidovorax_avenae_citrulli_AAC00-1

Duganella_sp._Root1480D1

Pandoraea_pnomenusa_3kgm

Acidovorax_wautersii_DSM_27981

Achromobacter_arsenitoxydans_SY8

Met

hylib

ium

_sp.

_YR6

05

Burkholderia_pseudomallei_K96243

Curvib

acte

r_lan

ceola

tus_

ATCC_14669

Acidovorax_citrulli_pslbtw65

Variov

orax

_para

doxus

_110B

Brac

hym

onas

_den

itrifi

cans

_DSM

_151

23

Burkholderia_pseudomallei_S13

Ram

libac

ter_

tata

ouin

ensis

_TTB

310

Burkholderia_mallei_NCTC_10229

Brac

hym

onas

_chi

rono

mi_

DSM

_198

84

Oxalobacter_formigenes_HOxBLS


Burk

hold

eria

_mim

osar

um_L

MG

_232

56

Massilia_lurida_CGMCC_1.10822

Herminiimonas_arsenicoxydans_ULPAs1

Janthinobacterium_sp._HH01

Burk

hold

eria

_sp.

_JC2

948

Burk

hold

eria

_gin

seng

isoli_

NBRC

_100

965

Oligella_ureolytica_DSM_18253

Burkholderia_pseudomallei_Pasteur


Burk

hold

eria_

phyt

ofirm

ans_

PsJN


Burk

hold

eria

_sp.

_YR2

81

Simplicispira_psychrophila_DSM_11588

Burkholderia_mallei_23344

Burk

holde

ria_s

p._JP

Y366

Burk

hold

eria

_mim

osar

um_N

BRC_

1063

38

Burkholderia

_dolosa

_AUO158

Burk

hold

eria_

dilw

orth

ii_W

SM35

56

Duganella_sp._Root336D2

Variovo

rax_sp

._Root4

11

Acidovorax_temperans_KY4

Comamonas_thiooxydans_DSM_17888

Bordetella_parapertussis_12822

Burkholderia_pyrro

cinia_CH-67

Chitinimonas_taiwanensis_DSM_18899

Ralstonia_solanacearum_U

W551

Variovo

rax_s

p._375M

FSha3

.1

Variov

orax

_sp._

URHB0020

Achromobacter_xylosoxidans_A8

Burkholderia_mallei_BMK

Cupriavidus_metallidurans_H1130

Burkholderia_cenocepacia_AU_1054

Collimonas_fungivorans_Ter331

Delftia_acidovorans_CCUG_274B

Burkholderia_cenocepacia_BC7

Achromobacter_piechaudii_ATCC_43553

Alcaligenes_faecalis_phenolicus_DSM_16503

Cupriavidus_sp._OV096

Ralstonia_sp._GA3-3

Herbaspirillum_frisingense_GSF30

Variovorax_sp

._EL159

Massilia_flava_CGMCC_1.10685

Burk

hold

eria

_hel

eia_

NBR

C_10

1817

Burk

hold

eria

_mim

osar

um_S

TM36

21

Duganella_sp._OV458

Burk

hold

eria_

caled

onica

_NBR

C_102

488

Extensimonas_vulgaris_CGMCC_1.10977

Burk

holde

ria_g

ram

inis_

C4D1M

Brackiella_oedipodis_DSM_13743

Ralstonia_pickettii_12J

Burkholderia_pseudomallei_1106b

Acidovorax_temperans_CB2

Limnobacter_sp._MED105 Acidovorax_sp._JS42

Polar

omon

as_s

p._JS

666_

UNC47MFT

su3.1

Hyd

roge

noph

aga_

sp._

PBC

Bordetella_holmesii_F627

Advenella_kashmirensis_W13003

Rose

atel

es_s

p._Y

R242

Alcaligenes_sp._EGD-AK7

Burkholderia_mallei_ATCC_23344

Comamonas_aquatica_DA1877

Burkholderia_cepacia_UCB_717,_ATCC_25416

Bordetella_bronchiseptica_RB50

Delftia_acidovorans_2167

Bordetella_hinzii_ATCC_51730


Oxalobacter_formigenes_OXCC13

Burk

hold

eria

_sp.

_YI2

3

Variovo

rax_s

p._278M

FTsu

5.1

Burkh

older

ia_ph

enoli

rupt

rix_A

C1100

Ralstonia_sp._5_2_56FAA

Ralstonia_eutropha_H16

Herbaspirillum_sp._CF444


Burkholderia_cepacia_RB-39

Burkholderia_pseudomallei_406e

Burk

hold

eria

_sp.

_SJ9

8


Ralstonia_sp._PBA

beta_proteobacterium_JGI_0001001-A11

Burkholderia_thailandensis_E264,_ATCC_700388

Variovorax_paradoxus_EPSVari

ovorax

_sp._1

60MFS

ha2.1

Janthinobacterium_lividum_PAMC_25724


Achromobacter_xylosoxidans_X02736

Ralstonia_solanacearum_P673

Cupriavidus_metallidurans_CH34

Burk

hold

eria_

sp._U

RHA00

54

Ralstonia_sp._JGI_0001001-A05

Rhod

ofer

ax_f

errir

educ

ens_

T118


Janthinobacterium_lividum_RIT308

Burkholderia_cenocepacia_PC184

Acidovorax_sp._JHL-9

Com

amon

adac

eae_

bact

eriu

m_H

1

Acidovorax_sp._MR-S7

Bordetella_bronchiseptica_C4

Burk

holde

ria_s

p._W

SM22

32

Burk

hold

eria

_ter

rae_

BS00

1

Burkholderia_cenocepacia_HI2424

Massilia_consociata_BSC265

Ralstonia_sp._UN

CCL144

FliOFliN

C-term

FlgB

FliFFlgC

FliH

FlgDFlgE

FliPFliA

FlgA

FlgI

Methylibium sp. YR605

Burkholderia sp. Msh1

Ralstonia solanacearum P673

Spearman correlation

PA orthogroups, Burkholderiales (Scoary)

PA o

rtho

grou

ps, B

urkh

olde

riale

s (S

coar

y)

Burkholderiales

a

b

Clustered orthogroups1

0.8

0.6

0.4

0.2

0

–0.2

–0.4

–0.6

–0.8

–1

Flagellar genes

Achromobacter

Roseateles, RhizobacterMethylibium

Variovorax

Delftia

Acidovorax

Collimonas

Massilia

Duganella

Ralstonia

Burkholderia

Not a significant PA gene(Scoary)

Thiomonas_bhubanesw

arensis_DSM

_18181Thiom

onas_intermedia_K12

Thiomonas_arsenitoxydans_3A

sC

aldimonas_m

anganoxidans_ATC

C_BA

A-369 Le

ptot

hrix

_cho

lodn

ii_SP

-6

Burk

hold

eria

les_

sp._

JOSH

I_00

1Ru

briv

ivax

_gel

atin

osus

_IL

144

Azo

hrdr

omon

as_a

ustr

alic

a_D

SM_1

124

Burkholderia_rhizoxinica_HKI_454

Burkholderia_sp.B13

Burk

hold

eria

_sp.

_JPY

347

Burk

hold

eria

_sp.

_YR5

20Bu

rkho

lder

ia_s

p._L

eaf1

77

Burk

hold

eria

_sor

didi

cola

_S17

0Bu

rkho

lder

ia_g

rimm

iae_

R27

Burkholderia_pseudomallei_354e

Fig. 5 | Co-occurring plant-associated and soil-associated flagellum-like gene clusters are sporadically distributed across Burkholderiales. a, Left, a hierarchically clustered correlation matrix of all 202 significant plant-associated (PA) orthogroups (gene clusters) from Burkholderiales, predicted by Scoary. Right, the orthogroups present within and adjacent to the flagellar-like locus of different genomes. Gene names based on a BLAST search are shown in parentheses. Hyp., hypothetical protein; RHS, RHS repeat protein. Genes illustrated above and below the black horizontal line for each species are located on the positive and negative strand, respectively. b, The Burkholderiales phylogenetic tree based on the concatenated alignment of 31 single-copy genes. Colored circles represent the 11 orthogroups presented in a, with the same color-coding as in a. Genus names are shown next to pillars of stacked circles. RA, root-associated.





A. caeni DSM 19327 (0,0,0)

Comamonas testosteroni ATCC 11996 (0,0,0)

A. sp. JS42 (0,0,0)

A. sp. MR-S7 (0,0,0)

A. ebreus TPSY (0,0,0)

A. temperans CB2 (0,0,0)

A. temperans KY4 (0,0,0)

A. sp. JHL-3 (0,0,0)

A. sp. JHL-9 (0,0,0)

Pathogen

Non-pathogen

PANPASoil

ba

A. sp. NO-1 (1,0,0)

A. sp. Root70 (10,0,0)

A. sp. KKS102 (10,0,0)

A. delafieldii DSM 64 (13,0,0)

A. sp. Root402 (10,0,0)

A. sp. Root568 (9,0,0)

A. sp. Leaf78 (10,0,0)

A. sp. Leaf76 (11,0,1)

A. sp. Leaf84 (12,0,2)

A. radicis N35 (13,0,0)

A. sp. Root267 (9,0,0)

A. sp. Root219 (7,0,0)

A. sp. Root217 (12,0,0)

A. avenae citrulli AAC00-1 (0,11,3)

A. citrulli DSM 17060 (0,15,4)

A. citrulli tw6 (0,15,3)

A. citrulli pslbtw65 (0,15,5)

A. oryzae ATCC 19882 (0,9,3)

A. avenae avenae RS-1 (0,15,8)A. avenaeT10_61 (0,11,6) A. avenae avenaeATCC 19860 (0,11,5)A. konjaci DSM 7481 (0,5,4)

A. valerianellae DSM 16619 (0,3,4)

A. sp. Leaf160 (0,0,0)

A. wautersii DSM 27981 (0,0,0)

d

100

100

100

77

100

100

100

100

100

100

Hyde2 Transposase

cHyde1

Jekyll

PAARHyde1-like Hyde2 Transposase

Pathogens

Non-pathogensand soil

Acidovoraxtree branches

Genomeclassification

Gene copynumber

(Jekyll, Hyde1, Hyde2)

Bacteriallifestyle

RNase G

Malate synthase

Acidovorax sp. Leaf191




Acidovorax avenae subsp. citrulli AAC00-1, watermelon

Acidovorax citrulli tw6, watermelon

Acidovorax avenae T10_61, sugarcane

Acidovorax avenae avenae RS-1, rice

Acidovorax oryzae ATCC 19882, rice

P. syringae pv. actinidia ICMP 18883,kiwifruit

P. syringae pv. tomato DC3000, tomato

P. syringae ICMP 18804,kiwifruit

P. syringae Pph1448A,common bean

P. syringae PmaM6 (Psy38),cauliflower

P. syringae pv. tomato Max13,tomato

P. syringae pv. syringae B728a, green bean

Fig. 6 | Rapidly diversifying, high-copy-number Jekyll and Hyde plant-associated genes. a, A maximum likelihood phylogenetic tree of Acidovorax isolates based on concatenation of 35 single-copy genes. The pathogenic and non-pathogenic branches of the tree are perfectly correlated with the presence of Hyde1 and Jekyll genes, respectively. b, An example of a variable Jekyll locus in highly related Acidovorax species isolated from leaves of wild Arabidopsis from Brugg, Switzerland. Arrows indicate the following locus tags (from top to bottom): Ga0102403_10161, Ga0102306_101276, Ga0102307_107159, and Ga0102310_10161. c, An example of a variable Hyde locus from pathogenic Acidovorax infecting different plants (the host plant is shown after the species name). The transposase in the first operon fragmented a Hyde2 gene. Arrows indicate the following locus tags (from top to bottom): Aave_3195, Ga0078621_123525, Ga0098809_1087148, T336DRAFT_00345, and AASARDRAFT_03920. d, An example of a variable Hyde locus from pathogenic Pseudomonas syringae infecting different plants. Arrows indicate the following locus tags (from top to bottom): PSPTOimg_00004880 (a.k.a. PSPTO_0475), A243_06583, NZ4DRAFT_02530, Pphimg_00049570, PmaM6_0066.00000100, PsyrptM_010100007142, and Psyr_4701. Genes color-coded with the same colors in b–d are homologous, with the exception of genes colored in ivory (unannotated genes) and Hyde1 and Hyde1-like genes, which are analogous in terms of their similar size, high diversification rate, position downstream of Hyde2, and tendency to have a transmembrane domain. PAAR, proline-alanine-alanine-arginine repeat superfamily.





short length and a transmembrane helix (Supplementary Fig. 26d). Hyde-carrying organisms included other phytopathogens, such as Pseudomonas syringae, in which the Hyde1-like-Hyde2 locus was again highly variable between closely related strains (Fig. 6d, Supplementary Table 23c). However, the striking Hyde genomic expansion was specific to the phytopathogenic Acidovorax lineage (Supplementary Table 23e). Notably, we observed that Hyde genes

often are directly preceded by genes that encode core structural T6SS proteins, such as PAAR, VgrG, and Hcp65, or are fused to PAAR (Fig. 6d, Supplementary Fig. 27a,b, Supplementary Table 23e). We therefore suggest that Hyde1 and/or Hyde2 might constitute a new T6SS effector family.

The high sequence diversity of Jekyll and Hyde1 genes suggests that the two plant-associated protein families encoded by these

a

b

log 10

CFU

/ml

E. coli +

Hyde2

(Aave

_0990)

E. coli +

Hyde1

(Aave

_0989)

E. coli +

Hyde1

(Aave

_319

1)

3

4

5

6

7

8

9

+ 0.5 mM IPTG

No IPTG

P = 0.0002

**

**

**

**

**

**

**–8

–6

–4

–2

0

2

Prey strains

log 10

(CFU

pre y

/CFU

neg)

AAC00-1 wt

AAC00-1 ∆T6SS

AAC00-1 ∆5-Hyde1

P = 2 × 10–7

P = 4 × 10–8

0.020.02

0.0050.003

0.0010.001

0.0460.042

0.0090.01

0.00010.0001

0.00070.001

E. coli +

GFP

L2 Novo

sphingobium

L11 Sphingomonas

L58 Pseu

domonas

L70 Sten

otrophomonas

L82 Fla

vobacte

rium

L98 Pseu

domonas

L130 Acin

etobacte

r

L405 Chrys

eobacte

rium

L434 Pseu

domonas

E. coli B

W25113

Fig. 7 | Hyde1 proteins of Acidovorax citrulli AAC00-1 are toxic to E. coli and various plant-associated bacterial strains. a, Toxicity assay of Hyde proteins expressed in E. coli. GFP, Hyde2-Aave_0990, and two Hyde1 genes from two loci, Aave_0989 and Aave_3191, were cloned into pET28b and transformed into E. coli C41 cells. Aave_0989 and Aave_3191 proteins were 53% identical. Bacterial cultures from five independent colonies were spotted on an LB plate. Gene expression of the cloned genes was induced with 0.5 mM IPTG. P values are shown for significant results (two-sided t-test). b, Quantification of recovered prey cells after coincubation with Acidovorax aggressor strains. Antibiotic-resistant prey strains E. coli BW25113 and nine different Arabidopsis leaf isolates were mixed at equal ratios with different aggressor strains or with NB medium (negative control). Five Hyde1 loci (including 9 out of 11 Hyde1 genes) are deleted in ∆ 5-Hyde1. ∆ T6SS contains a vasD (Aave_1470) deletion. After coincubation for 19 h on NB agar plates, mixed populations were resuspended in NB medium and spotted on selective antibiotic-containing NB agar. The box plots represent results from at least three independent experiments, with individual values superimposed as dots. The center line represents the median, the box limits represent the 25th and 75th percentiles, and the edges represent the minimal and maximal values. P values are shown at the top; double asterisks denote a significant difference (one-way ANOVA followed by Tukey’s honest significant difference test) between results for wild type versus ∆ T6SS and for wild type versus ∆ 5-Hyde1. Full strain names and statistical information are presented in Supplementary Table 25. For a time course experiment with exemplary strains, see Supplementary Fig. 29.





genes could be involved in molecular arms races with other organ-isms in the plant environment. As many type VI effectors are used in interbacterial warfare, we tested Acidovorax Hyde1 proteins for antibacterial properties. Expression of two variants of the gene in Escherichia coli led to a 105–106-fold reduction in cell numbers (Fig. 7a, Supplementary Table 25). We constructed a mutant strain of the phytopathogen Acidovorax citrulli AAC00-1with deletion of five Hyde1 loci (∆ 5-Hyde1), encompassing 9 of 11 Hyde1 genes (Supplementary Fig. 28, Supplementary Table 25). Wild-type, ∆ 5-Hyde1, and T6SS-mutant (∆ T6SS) Acidovorax strains were coincubated with an E. coli strain that is susceptible to T6SS kill-ing66 and nine phylogenetically diverse Arabidopsis leaf bacterial isolates16. Survival of wild-type E. coli and six of the leaf isolates after coincubation with wild-type Acidovorax was reduced 102–106-fold compared with that after coincubation with ∆ 5-Hyde1 or ∆ T6SS Acidovorax (Fig. 7b, Supplementary Fig. 29, Supplementary Table 25). Combined with the genomic association of Hyde loci with T6SS, these results suggest that the T6SS antibacterial phe-notype of Acidovorax is mediated by Hyde proteins and that these toxins could be used in competition against other plant-associated organisms. Consistent with a function in microbe–microbe inter-actions, we did not detect compromised virulence of the ∆ 5-Hyde1 strain on host plants (watermelon; data not shown). However, clearance of competitors via T6SS can promote the persistence of Acidovorax citrulli on its host67.

DiscussionThere is increasing awareness that plant-associated microbial communities have important roles in host growth and health. An understanding of plant–microbe relationships at the genomic level could enable scientists to use microbes to enhance agricultural productivity. Most studies have focused on specific plant micro-biomes, with more emphasis on microbial diversity than on gene function12,14,16,18,68–74. Here we sequenced nearly 500 root-associated bacterial genomes isolated from different plant hosts. These new genomes were combined in a collection of 3,837 high-quality bacte-rial genomes for comparative analysis. We developed a systematic approach to identify plant-associated and root-associated genes and putative operons. Our method is accurate as reflected by its abil-ity to capture numerous operons previously shown to have a plant-associated function, the enrichment of plant-associated genes in plant-associated metagenomes, the validation of Hyde1 proteins as likely type VI effectors in Acidovorax directed against other plant-associated bacteria, and the validation of two new genes in P. kuru-riensis that affect rice root colonization. We note that bacterial genes that are enriched in genomes from the plant environment are also likely to be involved in adaptation to the many other organisms that share the same niche, as we demonstrated for Hyde1.

We used five different statistical approaches to identify genes that were significantly associated with the plant/root environment, each with its advantages and disadvantages. The phylogeny-correcting approaches (phyloglmbin, phyloglmcn, and Scoary) allow accurate identification of genes that are polyphyletic and correlate with an environment independently of ancestral state. On the basis of our metagenome validation, the hypergeometric test predicts more genes that are abundant in plant-associated communities than PhyloGLM does. It also identifies monophyletic plant-associated genes, but it yields more false positives than the phylogenetic tests, because in every plant-associated lineage many lineage-specific genes will be considered plant-associated. Scoary is the most stringent method of all and yielded the fewest predictions (Supplementary Table 18). Future experimental validation should prioritize genes predicted in multiple taxa and/or by multiple approaches (Supplementary Figs. 5 and 6, Supplementary Tables 20 and 26).

We discovered 64 PREPARADOs. Proteins containing 19 of these domains are predicted to be secreted by the Sec or T3SS protein

secretion systems (Supplementary Table 21). Notably, plant proteins carrying 35 of these domains belonged to the NLR class of intracellu-lar innate immune receptors (Supplementary Fig. 23, Supplementary Table 21). Thus, these PREPARADO protein domains may serve as molecular mimics. Some may interfere with plant immune functions through disruption of key plant protein interactions75,76. Likewise, the Jacalin-containing proteins we detected in plant-associated bacteria, fungi, and oomycetes may represent a strategy of avoiding immunity triggered by microbial-associated molecular patterns, by binding to extracellular microbial mannose molecules and thereby serving as a molecular invisibility cloak77,78.

Finally, we demonstrated that numerous plant-associated func-tions are consistent across phylogenetically diverse bacterial taxa, and that some functions are even shared with plant-associated eukaryotes. Some of these traits may facilitate plant colonization by microbes and therefore might prove useful in genome engineering of agricultural inoculants to eventually yield a more efficient and sustainable agriculture.

URLs. iTOL Interactive tree (Fig. 1a), https://itol.embl.de/tree/15223230182273621508772620; datasets at the Dangl lab’s dedicated website, http://labs.bio.unc.edu/Dangl/Resources/gfobap_website/index.html (Dataset 1, FNA—nucleotide FASTA files of the 3,837 genomes; Dataset 2, FAA—FASTA files of all proteins used in the analysis; Dataset 3, COG/KEGG Orthology/Pfam/TIGRFAM IMG annotations of all genes used in analysis; Dataset 4, metadata of all genomes; Dataset 5, phylogenetic trees of each of the nine taxa; Dataset 6, pangenome matrices; Dataset 7, pangeneome data frames; Dataset 8, OrthoFinder orthogroup FASTA files; Dataset 9, Mafft MSA of all orthogroups; Dataset 10, hidden Markov models of all orthogroups; Dataset 11, plant-associated/NPA and root-asso-ciated/soil-associated enrichment tables; Dataset 12, correlation matrices; Dataset 13, predicted operons); DSMZ, https://www.dsmz.de/; ATCC, https://www.atcc.org/; NCBI Biosample, https://www.ncbi.nlm.nih.gov/biosample/; IMG, https://img.jgi.doe.gov/cgi-bin/mer/main.cgi; GOLD, https://gold.jgi.doe.gov/; Phytozome, https://phytozome.jgi.doe.gov/pz/portal.html; BrassicaDB, http://brassicadb.org/brad/; sm R package, http://www.stats.gla.ac.uk/~adrian/sm; vegan R package, https://cran.r-project.org/web/packages/vegan/index.html; ape R package, https://cran.r-project.org/web/packages/ape/ape.pdf; fpc R package, https://cran.r-proj-ect.org/web/packages/fpc/index.html; phylolmR package, https://cran.r-project.org/web/packages/phylolm/index.html; scripts used to compute the orthogroups, https://github.com/isaisg/gfobap/tree/master/orthofinder_diamond; scripts used to run the gene enrich-ment tests, https://github.com/isaisg/gfobap/tree/master/enrich-ment_tests; scripts used to perform the PCoA, https://github.com/isaisg/gfobap/tree/master/pcoa_visualization_ogs_enriched.

MethodsMethods, including statements of data availability and any asso-ciated accession codes and references, are available at https://doi.org/10.1038/s41588-017-0012-9.

Received: 20 January 2017; Accepted: 10 November 2017; Published: xx xx xxxx

References 1. Ley, R. E. et al. Evolution of mammals and their gut microbes. Science 320,

1647–1651 (2008). 2. Baumann, P. Biology bacteriocyte-associated endosymbionts of plant

sap-sucking insects. Annu. Rev. Microbiol. 59, 155–189 (2005). 3. Sprent, J. I. 60Ma of legume nodulation. What’s new? What’s changing?

J. Exp. Bot. 59, 1081–1084 (2008). 4. Pfeilmeier, S., Caly, D. L. & Malone, J. G. Bacterial pathogenesis

of plants: future challenges from a microbial perspective: Challenges in Bacterial Molecular Plant Pathology. Mol. Plant Pathol. 17, 1298–1313 (2016).



https://itol.embl.de/tree/15223230182273621508772620

https://itol.embl.de/tree/15223230182273621508772620

http://labs.bio.unc.edu/Dangl/Resources/gfobap_website/index.html


https://www.dsmz.de/

https://www.dsmz.de/

https://www.atcc.org/

https://www.ncbi.nlm.nih.gov/biosample/

https://www.ncbi.nlm.nih.gov/biosample/

https://img.jgi.doe.gov/cgi-bin/mer/main.cgi

https://img.jgi.doe.gov/cgi-bin/mer/main.cgi

https://gold.jgi.doe.gov/

https://phytozome.jgi.doe.gov/pz/portal.html

http://brassicadb.org/brad/

http://www.stats.gla.ac.uk/~adrian/sm

http://www.stats.gla.ac.uk/~adrian/sm

https://cran.r-project.org/web/packages/vegan/index.html

https://cran.r-project.org/web/packages/vegan/index.html

https://cran.r-project.org/web/packages/ape/ape.pdf

https://cran.r-project.org/web/packages/ape/ape.pdf

https://cran.r-project.org/web/packages/fpc/index.html

https://cran.r-project.org/web/packages/fpc/index.html

https://cran.r-project.org/web/packages/phylolm/index.html

https://cran.r-project.org/web/packages/phylolm/index.html

https://github.com/isaisg/gfobap/tree/master/orthofinder_diamond

https://github.com/isaisg/gfobap/tree/master/orthofinder_diamond

https://github.com/isaisg/gfobap/tree/master/enrichment_tests

https://github.com/isaisg/gfobap/tree/master/enrichment_tests

https://github.com/isaisg/gfobap/tree/master/pcoa_visualization_ogs_enriched

https://github.com/isaisg/gfobap/tree/master/pcoa_visualization_ogs_enriched

https://doi.org/10.1038/s41588-017-0012-9

https://doi.org/10.1038/s41588-017-0012-9



5. Chowdhury, S. P., Hartmann, A., Gao, X. & Borriss, R. Biocontrol mechanism by root-associated Bacillus amyloliquefaciens FZB42—a review. Front. Microbiol. 6, 780 (2015).

6. Fibach-Paldi, S., Burdman, S. & Okon, Y. Key physiological properties contributing to rhizosphere adaptation and plant growth promotion abilities of Azospirillum brasilense. FEMS Microbiol. Lett. 326, 99–108 (2012).

7. Santhanam, R. et al. Native root-associated bacteria rescue a plant from a sudden-wilt disease that emerged during continuous cropping. Proc. Natl. Acad. Sci. USA 112, E5013–E5020 (2015).

8. Peters, N. K., Frost, J. W. & Long, S. R. A plant flavone, luteolin, induces expression of Rhizobium meliloti nodulation genes. Science 233, 977–980 (1986).

9. Hiei, Y., Ohta, S., Komari, T. & Kumashiro, T. Efficient transformation of rice (Oryza sativa L.) mediated by Agrobacterium and sequence analysis of the boundaries of the T-DNA. Plant J. 6, 271–282 (1994).

10. Hueck, C. J. Type III protein secretion systems in bacterial pathogens of animals and plants. Microbiol. Mol. Biol. Rev. 62, 379–433 (1998).

11. Bulgarelli, D. et al. Revealing structure and assembly cues for Arabidopsis root-inhabiting bacterial microbiota. Nature 488, 91–95 (2012).

12. Lundberg, D. S. et al. Defining the core Arabidopsis thaliana root microbiome. Nature 488, 86–90 (2012).

13. Bulgarelli, D., Schlaeppi, K., Spaepen, S., Ver Loren van Themaat, E. & Schulze-Lefert, P. Structure and functions of the bacterial microbiota of plants. Annu. Rev. Plant Biol. 64, 807–838 (2013).

14. Ofek-Lalzar, M. et al. Niche and host-associated functional signatures of the root surface microbiome. Nat. Commun. 5, 4950 (2014).

15. Gottel, N. R. et al. Distinct microbial communities within the endosphere and rhizosphere of Populus deltoides roots across contrasting soil types. Appl. Environ. Microbiol. 77, 5934–5944 (2011).

16. Bai, Y. et al. Functional overlap of the Arabidopsis leaf and root microbiota. Nature 528, 364–369 (2015).

17. Hardoim, P. R. et al. The hidden world within plants: ecological and evolutionary considerations for defining functioning of microbial endophytes. Microbiol. Mol. Biol. Rev. 79, 293–320 (2015).

18. Bulgarelli, D. et al. Structure and function of the bacterial root microbiota in wild and domesticated barley. Cell Host Microbe 17, 392–403 (2015).

19. Hacquard, S. et al. Microbiota and host nutrition across plant and animal kingdoms. Cell Host Microbe 17, 603–616 (2015).

20. Tatusov, R. L., Galperin, M. Y., Natale, D. A. & Koonin, E. V. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33–36 (2000).

21. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).

22. Haft, D. H., Selengut, J. D. & White, O. The TIGRFAMs database of protein families. Nucleic Acids Res. 31, 371–373 (2003).

23. Huntemann, M. et al. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4). Stand. Genomic Sci. 10, 86 (2015).

24. Emms, D. M. & Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 16, 157 (2015).

25. Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2016).

26. Ives, A. R. & Garland, T. Jr. Phylogenetic logistic regression for binary dependent variables. Syst. Biol. 59, 9–26 (2010).

27. Brynildsrud, O., Bohlin, J., Scheffer, L. & Eldholm, V. Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol. 17, 238 (2016).

28. Hultman, J. et al. Multi-omics of permafrost, active layer and thermokarst bog soil microbiomes. Nature 521, 208–212 (2015).

29. Louca, S. et al. Integrating biogeochemistry with multiomic sequence information in a model oxygen minimum zone. Proc. Natl. Acad. Sci. USA 113, E5925–E5933 (2016).

30. Coutinho, B. G., Licastro, D., Mendonça-Previato, L., Cámara, M. & Venturi, V. Plant-influenced gene expression in the rice endophyte Burkholderia kururiensis M130. Mol. Plant Microbe Interact. 28, 10–21 (2015).

31. Long, S. R. Rhizobium-legume nodulation: life together in the underground. Cell 56, 203–214 (1989).

32. Ruvkun, G. B., Sundaresan, V. & Ausubel, F. M. Directed transposon Tn5 mutagenesis and complementation analysis of Rhizobium meliloti symbiotic nitrogen fixation genes. Cell 29, 551–559 (1982).

33. Hershey, D. M., Lu, X., Zi, J. & Peters, R. J. Functional conservation of the capacity for ent-kaurene biosynthesis and an associated operon in certain rhizobia. J. Bacteriol. 196, 100–106 (2014).

34. Nett, R. S. et al. Elucidation of gibberellin biosynthesis in bacteria reveals convergent evolution. Nat. Chem. Biol. 13, 69–74 (2017).

35. Scharf, B. E., Hynes, M. F. & Alexandre, G. M. Chemotaxis signaling systems in model beneficial plant-bacteria associations. Plant Mol. Biol. 90, 549–559 (2016).

36. Büttner, D. & He, S. Y. Type III protein secretion in plant pathogenic bacteria. Plant Physiol. 150, 1656–1664 (2009).

37. Gao, R. et al. Genome-wide RNA sequencing analysis of quorum sensing-controlled regulons in the plant-associated Burkholderia glumae PG1 strain. Appl. Environ. Microbiol. 81, 7993–8007 (2015).

38. Weller-Stuart, T., Toth, I., De Maayer, P. & Coutinho, T. Swimming and twitching motility are essential for attachment and virulence of Pantoea ananatis in onion seedlings. Mol. Plant Pathol. 18, 734–745 (2017).

39. De Weger, L. A. et al. Flagella of a plant-growth-stimulating Pseudomonas fluorescens strain are required for colonization of potato roots. J. Bacteriol. 169, 2769–2773 (1987).

40. de Weert, S. et al. Flagella-driven chemotaxis towards exudate components is an important trait for tomato root colonization by Pseudomonas fluorescens. Mol. Plant Microbe Interact. 15, 1173–1180 (2002).

41. Ravcheev, D. A. et al. Comparative genomics and evolution of regulons of the LacI-family transcription factors. Front. Microbiol. 5, 294 (2014).

42. Yamauchi, Y., Hasegawa, A., Taninaka, A., Mizutani, M. & Sugimoto, Y. NADPH-dependent reductases involved in the detoxification of reactive carbonyls in plants. J. Biol. Chem. 286, 6999–7009 (2011).

43. Burstein, D. et al. Genome-scale identification of Legionella pneumophila effectors using a machine learning approach. PLoS Pathog. 5, e1000508 (2009).

44. Dean, P. Functional domains and motifs of bacterial type III effector proteins and their roles in infection. FEMS Microbiol. Rev. 35, 1100–1125 (2011).

45. Stebbins, C. E. & Galán, J. E. Structural mimicry in bacterial virulence. Nature 412, 701–705 (2001).

46. Price, C. T. et al. Molecular mimicry by an F-box effector of Legionella pneumophila hijacks a conserved polyubiquitination machinery within macrophages and protozoa. PLoS Pathog. 5, e1000704 (2009).

47. Rothmeier, E. et al. Activation of Ran GTPase by a Legionella effector promotes microtubule polymerization, pathogen vacuole motility and infection. PLoS Pathog. 9, e1003598 (2013).

48. Xu, R.-Q. et al. AvrAC(Xcc8004), a type III effector with a leucine-rich repeat domain from Xanthomonas campestris pathovar campestris confers avirulence in vascular tissues of Arabidopsis thaliana ecotype Col-0. J. Bacteriol. 190, 343–355 (2008).

49. Shevchik, V. E., Robert-Baudouy, J. & Hugouvieux-Cotte-Pattat, N. Pectate lyase PelI of Erwinia chrysanthemi 3937 belongs to a new family. J. Bacteriol. 179, 7321–7330 (1997).

50. Cesari, S., Bernoux, M., Moncuquet, P., Kroj, T. & Dodds, P. N. A novel conserved mechanism for plant NLR protein pairs: the “integrated decoy” hypothesis. Front. Plant Sci. 5, 606 (2014).

51. Sarris, P. F. et al. A plant immune receptor detects pathogen effectors that target WRKY transcription factors. Cell 161, 1089–1100 (2015).

52. Sarris, P. F., Cevik, V., Dagdas, G., Jones, J. D. & Krasileva, K. V. Comparative analysis of plant immune receptor architectures uncovers host proteins likely targeted by pathogens. BMC Biol. 14, 8 (2016).

53. Le Roux, C. et al. A receptor pair with an integrated decoy converts pathogen disabling of transcription factors to immunity. Cell 161, 1074–1088 (2015).

54. Brown, G. D. & Netea, M. G. (eds.). Immunology of Fungal Infections. (Springer, Dordrecht, The Netherlands, 2007).

55. Gadjeva, M., Takahashi, K. & Thiel, S. Mannan-binding lectin—a soluble pattern recognition molecule. Mol. Immunol. 41, 113–121 (2004).

56. Ma, Q.-H., Tian, B. & Li, Y.-L. Overexpression of a wheat jasmonate-regulated lectin increases pathogen resistance. Biochimie 92, 187–193 (2010).

57. Xiang, Y. et al. A jacalin-related lectin-like gene in wheat is a component of the plant defence system. J. Exp. Bot. 62, 5471–5483 (2011).

58. Yamaji, Y. et al. Lectin-mediated resistance impairs plant virus infection at the cellular level. Plant Cell 24, 778–793 (2012).

59. Weidenbach, D. et al. Polarized defense against fungal pathogens is mediated by the Jacalin-related lectin domain of modular Poaceae-specific proteins. Mol. Plant 9, 514–527 (2016).

60. Sahly, H. et al. Surfactant protein D binds selectively to Klebsiella pneumoniae lipopolysaccharides containing mannose-rich O-antigens. J. Immunol. 169, 3267–3274 (2002).

61. Osborn, M. J., Rosen, S. M., Rothfield, L., Zeleznick, L. D. & Horecker, B. L. Lipopolysaccharide of the gram-negative cell wall. Science 145, 783–789 (1964).

62. Tans-Kersten, J., Huang, H. & Allen, C. Ralstonia solanacearum needs motility for invasive virulence on tomato. J. Bacteriol. 183, 3597–3605 (2001).

63. Cole, B. J. et al. Genome-wide identification of bacterial plant colonization genes. PLoS Biol. 15, e2002860 (2017).

64. Poggio, S. et al. A complete set of flagellar genes acquired by horizontal transfer coexists with the endogenous flagellar system in Rhodobacter sphaeroides. J. Bacteriol. 189, 3208–3216 (2007).





65. Ho, B. T., Dong, T. G. & Mekalanos, J. J. A view to a kill: the bacterial type VI secretion system. Cell Host Microbe 15, 9–21 (2014).

66. MacIntyre, D. L., Miyata, S. T., Kitaoka, M. & Pukatzki, S. The Vibrio cholerae type VI secretion system displays antimicrobial properties. Proc. Natl. Acad. Sci. USA 107, 19520–19524 (2010).

67. Tian, Y. et al. The type VI protein secretion system contributes to biofilm formation and seed-to-seedling transmission of Acidovorax citrulli on melon. Mol. Plant Pathol. 16, 38–47 (2015).

68. Peiffer, J. A. et al. Diversity and heritability of the maize rhizosphere microbiome under field conditions. Proc. Natl. Acad. Sci. USA 110, 6548–6553 (2013).

69. Agler, M. T. et al. Microbial hub taxa link host and abiotic factors to plant microbiome variation. PLoS Biol. 14, e1002352 (2016).

70. Bokulich, N. A., Thorngate, J. H., Richardson, P. M. & Mills, D. A. Microbial biogeography of wine grapes is conditioned by cultivar, vintage, and climate. Proc. Natl. Acad. Sci. USA 111, E139–E148 (2014).

71. Coleman-Derr, D. et al. Plant compartment and biogeography affect microbiome composition in cultivated and native Agave species. New Phytol. 209, 798–811 (2016).

72. Shade, A., McManus, P. S. & Handelsman, J. Unexpected diversity during community succession in the apple flower microbiome. MBio 4, e00602–e00612 (2013).

73. Turner, T. R. et al. Comparative metatranscriptomics reveals kingdom level changes in the rhizosphere microbiome of plants. ISME J. 7, 2248–2258 (2013).

74. Edwards, J. et al. Structure, variation, and assembly of the root-associated microbiomes of rice. Proc. Natl. Acad. Sci. USA 112, E911–E920 (2015).

75. Kroj, T., Chanclud, E., Michel-Romiti, C., Grand, X. & Morel, J.-B. Integration of decoy domains derived from protein targets of pathogen effectors into plant immune receptors is widespread. New Phytol. 210, 618–626 (2016).

76. Mukhtar, M. S. et al. Independently evolved virulence effectors converge onto hubs in a plant immune system network. Science 333, 596–601 (2011).

77. Vimr, E. & Lichtensteiger, C. To sialylate, or not to sialylate: that is the question. Trends Microbiol. 10, 254–257 (2002).

78. de Jonge, R. et al. Conserved fungal LysM efector Ecp6 prevents chitin-triggered immunity in plants. Science 329, 953–955 (2010).

AcknowledgementsThe work conducted by the US Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is supported by the Office of Science of the US Department of Energy under contract no. DE-AC02-05CH11231. J.L.D. and S.G.T. were supported by NSF INSPIRE grant IOS-1343020, and J.L.D. was also supported by DOE–USDA Feedstock Award DE-SC001043 and by the Office of Science (BER), US Department of Energy, grant no. DE-SC0014395. S.H.P. was supported by NIH Training Grant T32 GM067553-06 and was a Howard Hughes Medical Institute (HHMI) International Student Research Fellow. D.S.L. was supported by NIH Training Grant T32 GM07092-34. J.L.D. is an Investigator of the HHMI, supported by the HHMI and the

Gordon and Betty Moore Foundation (GBMF3030). M.E.F. was supported by NIH Dr. Ruth L. Kirschstein NRSA Fellowship F32-GM112345. D.A.P. and T.-Y.L. were supported by the Genomic Science Program, US Department of Energy, Office of Science, Biological and Environmental Research as part of the Oak Ridge National Laboratory Plant Microbe Interfaces Scientific Focus Area (http://pmi.ornl.gov) and Plant Feedstock Genomics Award DE-SC001043. Oak Ridge National Laboratory is managed by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725. J.A.V. was supported by a SystemsX.ch grant (Micro2X) and a European Research Council (ERC) advanced grant (PhyMo). We thank I. Bertani, C. Bez, R. Bowers, D. Burstein, A. Chun Chen, D. Chiniquy, B. Cole, O. Cohen, A. Copeland, J. Eisen, E. Eloe-Fadrosh, M. Hadjithomas, O. Finkel, H. Schnitzel Meule Fux, N. Ivanova, J. Knelman, R. Malmstrom, R. Perez-Torres, D. Salomon, R. Sorek, T. Mucyn, R. Seshadri, T.K. Reddy, L. Ryan, and H. Sberro Livnat for general help, text editing, and ideas for this work. We thank R. Walcott (University of Georgia, Athens, GA, USA) for providing the Acidovorax citrulli VasD mutant strain.

Author contributionsA.L. performed most data analysis and wrote the paper. I.S.G. performed phylogenetic inference, performed phylogenetically aware analyses, analyzed the data, provided the supporting website, and contributed to manuscript writing. M. Mittelviefhaus and J.A.V. designed and performed experiments related to Hyde1 gene function and contributed to manuscript writing. S.C. isolated single bacterial cells and prepared metadata for data analysis. F.M. analyzed data. S.H.P. analyzed data and contributed to manuscript writing. J.M. produced a mutant strain for Hyde1. K.W. tested Hyde1 toxicity in E. coli. G.D. and V.V. produced deletion mutants and designed and performed rice root colonization experiments. K.S. helped in data analysis. B.R.A. prepared metadata for data analysis. D.S.L., T.-Y.L., S.L., Z.J., M. McDonald, A.P.K., M.E.F., and S.L.D. isolated bacteria from different plants or managed this process. T.G.d.R. managed the sequencing project. S.R.G., D.A.P., and R.E.L. managed bacterial isolation efforts and contributed to manuscript writing. B.Z. managed Hyde1 deletion and toxicity testing. S.G.T. contributed to manuscript writing. T.W. managed single-cell isolation efforts and contributed to manuscript writing. J.L.D. directed the overall project and contributed to manuscript writing.

Competing interestsJ.L.D. is a cofounder of and shareholder in, and S.H.P. collaborates with, AgBiome LLC, a corporation that aims to use plant-associated microbes to improve plant productivity.

Additional informationSupplementary information is available for this paper at https://doi.org/10.1038/s41588-017-0012-9.

Reprints and permissions information is available at www.nature.com/reprints.

Correspondence and requests for materials should be addressed to S.G.T. or T.W. or J.L.D.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



http://pmi.ornl.gov

https://doi.org/10.1038/s41588-017-0012-9

https://doi.org/10.1038/s41588-017-0012-9

http://www.nature.com/reprints



MethodsAdditional method descriptions appear in Supplementary Note 1.

Bacterial isolation and genome sequencing. The detailed isolation procedure is described in Supplementary Note 1. Bacterial strains from Brassicaceae and poplar were isolated via previously described protocols79,80. Poplar strains were cultured from root tissues collected from Populus deltoides and Populus trichocarpa trees in Tennessee, North Carolina, and Oregon. Root samples were processed as described previously15,80. Briefly, we isolated rhizosphere strains by plating serial dilutions of root wash, whereas for endosphere strains, we pulverized surface-sterilized roots with a sterile mortar and pestle in 10 mL of MgSO4 (10 mM) solution before plating serial dilutions. Strains were isolated on R2A agar media, and the resulting colonies were picked and re-streaked a minimum of three times to ensure isolation. Isolated strains were identified by 16S rDNA PCR followed by Sanger sequencing.

For maize isolates, we selected soils associated with Il14h and Mo17 maize genotypes grown in Lansing, NY, and Urbana, IL. The rhizosphere soil samples of each maize genotype were grown at each location and were collected at week 12 as previously described68. From each rhizosphere soil sample, soil was washed and samples were plated onto Pseudomonas isolation agar (BD Diagnostic Systems). The plates were incubated at 30 °C until colonies formed, and DNA was extracted from cells.

For isolation of single cells, A. thaliana accessions Col-0 and Cvi-0 were grown to maturity. Roots were washed in distilled water multiple times. Root surfaces were sterilized with bleach. Surface-sterilized roots were then ground with a sterile mortar and pestle. Individual cells were isolated by flow cytometry followed by DNA amplification with MDA, and 16S rDNA screening as described previously81.

DNA from isolates and single cells was sequenced on next-generation sequencing platforms, mostly using Illumina HiSeq technology (Supplementary Table 3). Sequenced genomic DNA was assembled via different assembly methods (Supplementary Table 3). Genomes were annotated using the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)23 and deposited at the IMG database (“URLs”), ENA, or Genbank for public use.

Data compilation of 3,837 isolate genomes and their isolation-site metadata. We retrieved 5,586 bacterial genomes from the IMG system (“URLs,” Supplementary Table 1). Isolation sites were identified through a manual curation process that included scanning of IMG metadata, DSMZ, ATCC, NCBI Biosample (“URLs”), and the scientific literature. On the basis of its isolation site, each genome was labeled as plant-associated, NPA, or soil-associated. Plant-associated organisms were also labeled as root-associated when isolated from the endophytic compartments or from the rhizoplane. We applied stringent quality control measures to ensure a high-quality and minimally biased set of genomes:

a. Known isolation site: genomes with missing isolation-site information were filtered out.

b. High genome quality and completeness: all isolate genomes passed this filter if N50 (the shortest sequence length at 50% of the genome) was more than 50,000 bp. Single amplified genomes passed the quality filter if they had at least 90% of 35 universal single-copy clusters of orthologous groups (COGs)82. In addition, we used CheckM83 to assess isolate genome completeness and contamination. Only genomes that were at least 95% complete and no more than 5% contaminated were used.

c. High-quality gene annotation: genomes that passed this filter had at least 90% genome sequence coding for genes, with an exception—in the Bartonella genus most genomes have coding base percentages below 90%.

d. Nonredundancy: we computed whole-genome average nucleotide identity and alignment fraction values for each pair of genomes84. When the alignment fraction exceeded 90% and the whole-genome average nucleotide identity was greater than 99.995% we considered the genome pair redundant. In such cases one genome was randomly selected, and the other genome was marked as redundant and was filtered out.

e. Consistency in the phylogenetic tree: we filtered out 14 bacterial genomes that showed discrepancy between their given taxonomy and their actual phylogenetic placement in the bacterial tree.

Construction of the bacterial genome tree. To generate a phylogenetic tree of the 3,837 high-quality and nonredundant bacterial genomes, we retrieved 31 universal single-copy genes from each genome with AMPHORA285. For each individual marker gene, we used Muscle with default parameters to construct an alignment. We masked the 31 alignments by using Zorro86 and filtered the low-quality columns of the alignment. Finally, we concatenated the 31 alignments into an overall merged alignment, from which we built an approximately maximum-likelihood phylogenetic tree with the WAG model implemented in FastTree 2.187. Trees of each taxon are provided in Dataset S5 at http://labs.bio.unc.edu/Dangl/Resources/gfobap_website/faa_trees_metadata.html.

Clustering of 3,837 genomes into nine taxa. We divided the dataset into different taxa (taxonomic groups) to allow downstream identification of genes enriched in the plant-associated or root-associated genomes of each taxon compared with the NPA or soil-associated genomes from the same taxon, respectively. To determine

the number of taxonomic groups to analyze, we converted the phylogenetic tree into a distance matrix, using the cophenetic function implemented in the R package ape (“URLs”). We then clustered the 3,837 genomes into nine groups using k-medoids clustering as implemented in the PAM (partitioning around medoids) algorithm from the R package fpc (“URLs”). The k-medoids algorithm clusters a dataset of n objects into k a priori–defined clusters. To identify the optimal k value for the dataset, we compared the silhouette coefficients for values of k ranging from 1 to 30. We selected a value of k = 9 because it yielded the maximal average silhouette coefficient (0.66). In addition, at k = 9 the taxa were monophyletic, contained hundreds of genomes, and were relatively balanced between plant-associated and NPA genomes in most taxa (Table 1). The resulting genome clusters generally overlapped with annotated taxonomic units. One exception was in the Actinobacteria phylum. Here our clustering divided the genomes into two taxa that we named, for simplicity, Actinobacteria 1 and Actinobacteria 2. However, our rigorous phylogenetic analysis supports previous suggestions for revisions in the taxonomy of phylum Actinobacteria88.

In addition, the tree showed very divergent bacterial taxa in the Bacteroidetes phylum that could not be separated into monophyletic groups. Specifically, the Sphingobacteriales order (from class Sphingobacteria) and the Cytophagaceae (from class Cytophagia) are paraphyletic. Therefore, we decided to unify all Bacteroidetes into one phylum-level taxon. Analysis of the prevalence of the nine taxa in 16S rDNA and metagenome appears in the Supplementary Information.

Pangenome analysis. For each comparison in Supplementary Fig. 2, a random set of ten genomes from each environment (plant-associated and NPA from specific environments) was selected, and the mean and s.d. of the phylogenetic distance in the set were calculated. This step was repeated 50 times to produce two random sets of genomes (plant-associated and NPA) that were comparable and had minimum differences between their mean and s.d. of phylogenetic distances. Genes for pangenome analysis were taken from the orthogroups (see below). Core genome, accessory genome, and unique genes were defined as genes that appeared in all ten genomes, in two to nine genomes, and in only one genome, respectively. For core and accessory genomes, the median copy number in each relevant orthogroup was used.

Genome size comparison and gene category enrichment analysis. Genome sizes were retrieved from the IMG database (“URLs”) and compared by t-test and PhyloGLM26. Kernel density plots from the R sm package (“URLs”) were used to prepare Supplementary Fig. 1. Protein-coding genes were retrieved and mapped to COG IDs with the program RPS-BLAST at an e-value cutoff of 1e–2 and an alignment length of at least 70% of the consensus sequence length. Each COG ID was mapped to at least one COG category (Supplementary Table 6). For each genome, we counted the number of genes from a given category. A t-test and PhyloGLM test were used to compare the number of genes in the genomes that shared the same taxon and category but different labels (e.g., plant-associated versus NPA).

Benchmarking gene clustering with UCLUST and OrthoFinder. We computed clusters of coding sequences across each of the nine taxa defined above with two algorithms: UCLUST89 (v 7.0) and OrthoFinder24 (v 1.1.4). UCLUST was run with 50% identity and 50% coverage in the target to call the clusters. Command used: usearch7.0.1090_i86linux64 -cluster_fast < input_file > -id 0.5 -maxaccepts 0 -maxrejects 0 -target_cov 0.5 –uc < output_file > . To improve pairwise alignment performance, we used the accelerated protein alignment algorithm implemented in DIAMOND90 (v 0.8.36.98) with the --very-sensitive option in the DIAMOND BLASTP algorithm. After computing the alignments, we ran OrthoFinder with default parameters. See “URLs” for the scripts used to compute the orthogroups.

Supplementary Fig. 5 shows benchmarking of OrthoFinder against UCLUST. To estimate the quality of the clusters output by UCLUST and OrthoFinder, we mapped the proteins from our datasets to the curated set of taxon markers from Phyla-AMPHORA91. Next, we compared the distribution of each of the taxon-specific markers identified by Phyla-AMPHORA across the clusters output by UCLUST and OrthoFinder. To compare the two approaches, we estimated two metrics: the purity and the fragmentation index, explained in Supplementary Fig. 5 and in the Supplementary Information.

Identification of plant-associated, NPA, root-associated, and soil genes/domains. The following description applies to plant-associated, NPA, root-associated, and soil genes. For conciseness, only plant-associated genes are described here. Plant-associated genes were identified via a two-step process that included protein/domain clustering on the basis of amino acid sequence similarity and subsequent identification of the protein/domain clusters significantly enriched in proteins/domains from plant-associated bacteria (Supplementary Fig. 4). Clustering of genes and protein domains involved five independent methods: OrthoFinder24, COG20, KEGG Orthology (KO)21, TIGRFAM22, and Pfam25. OrthoFinder was selected (after the aforementioned benchmarking) as a clustering approach that included all proteins, including those that lack any functional annotation. We first compiled, for each taxon separately, a list of all proteins in the genomes. For COG, KO, TIGRFAM, and Pfam, we used the existing annotations



http://labs.bio.unc.edu/Dangl/Resources/gfobap_website/faa_trees_metadata.html

http://labs.bio.unc.edu/Dangl/Resources/gfobap_website/faa_trees_metadata.html



of IMG genes that are based on BLAST alignments to the different protein/domain models23. This process yielded gene/domain clusters. Next, we determined which clusters were significantly enriched with genes derived from plant-associated genomes. These clusters were termed plant-associated clusters. In the statistical analysis, we used only clusters of more than five members. We corrected P values with Benjamini–Hochberg FDR and used q < 0.05 as the significance threshold, unless stated otherwise. The proteins in each cluster were categorized as either plant-associated or NPA, on the basis of the label of the encoding genome. Namely, a plant-associated gene is a gene derived from a plant-associated gene cluster and a plant-associated genome.

The three main approaches were the hypergeometric test (Hyperg), PhyloGLM, and Scoary. Hyperg looks for overall enrichment of gene copies across a group of genomes but ignores the phylogenetic structure of the dataset. PhyloGLM26 takes into account phylogenetic information to eliminate apparent enrichments that can be explained by shared ancestry. The Hyperg and PhyloGLM tests were used in two versions, based on either gene presence/absence data (hypergbin, phyloglmbin) or gene copy-number data (hypergcn, phyloglmcn). We also used a stringent version of Scoary27, a gene presence/absence approach that combines Fisher’s exact test, a phylogenetic test, and a label-permutation test. The first hypergeometric test, hypergcn, used the gene copy-number data, with the cluster being the sample, the total number of plant-associated and NPA genes being the population, and the number of plant-associated genes within the cluster being considered as ‘successes’. The second version, hybergbin, used gene presence/absence data. P values were corrected by Benjamini–Hochberg FDR92 for clusters of COG/KO/TIGRFAM/Pfam. For the abundant OrthoFinder clusters, we used Bonferroni correction with a threshold of P < 0.1, as downstream validation with metagenomes showed fewer false positives with the more significant clusters. The third and fourth statistical approaches used PhyloGLM26, implemented in the phylolm (v 2.5) R package (“URLs”). PhyloGLM combines a Markov process of lifestyle (e.g., plant-associated versus NPA) evolution with a regularized logistic regression. This approach takes advantage of the known phylogeny to specify the residual correlation structure between genomes that share common ancestry, and so it does not need to make the incorrect assumption that observations are independent. Intuitively PhyloGLM favors genes found in multiple lineages of the same taxon. For each taxon we used the subtree from Fig. 1a to estimate the correlation matrix between observations and used the copy number (in phyloglmcn) or presence/absence pattern (in phyloglmbin) of each gene as the only independent variable. Positive and negative estimates in phyloglmbin/phyloglmcn indicated plant-associated/root-associated and NPA/soil-associated proteins/domains, respectively.

Finally, the fifth statistical approach was Scoary27, which uses a gene presence/absence dataset. Scoary combines Fisher’s exact test, a phylogeny-aware test, and an empirical label-switching permutation analysis. A gene cluster was considered significant by Scoary only if (1) it had a q-value less than 0.05 for Fisher’s exact test, (2) the ‘worst’ P value from the pairwise comparison algorithm was < 0.05, and (3) the empirical (permutation-based) P value was < 0.05. These are very stringent criteria that yielded relatively few significant predictions. Odds ratios greater than or less than 1 in Scoary indicated plant-associated/root-associated and NPA/soil-associated proteins/domains, respectively.

See “URLs” for links to the code used for the gene enrichment tests. A description of additional assessment of plant-associated/NPA prediction robustness using validation genome datasets is presented in Supplementary Note 1.

Validation of predicted plant-associated, NPA, root-associated, and soil-associated genes using metagenomes. Metagenome samples (n = 38; Supplementary Table 16) were downloaded from NCBI and GOLD (“URLs”). The reads were translated into proteins, and proteins at least 40 amino acids long were aligned using HMMsearch93 against the different protein references. The protein references included the predicted plant-associated, root-associated, soil-associated, and NPA proteins from OrthoFinder that were found to be significant by the different approaches. The normalization process is explained in Supplementary Figs. 12–16.

Principal coordinates analysis. To visualize the overall contribution of statistically significant enriched/depleted orthogroups to the differentiation of plant-associated and NPA genomes, we used principal coordinates analysis (PCoA) and logistic regression. For each of the nine taxa analyzed, we ran this analysis over a collection of matrices. The first matrix was the full pan-genome matrix, which depicted the distribution of all the orthogroups contained across all the genomes in a given taxon. The subsequent matrices represented subsets of the full pan-genome matrix; each of these matrices depicted the distribution of only the statistically significant orthogroups as called by one of the five different algorithms used to test for the genotype–phenotype association. A full description of this process is presented in Supplementary Note 1.

We used the function cmdscale from the R (v 3.3.1) stats package to run PCoA over all the matrices described above, using the Canberra distance as implemented in the vegdist function from the vegan (v 2.4-2) R package (“URLs”). Then, we took the first two axes output from the PCoA and used them as independent variables to fit a logistic regression over the labels of each genome (plant-associated, NPA). Finally, we computed the Akaike information criterion for each of the different

models fitted. Briefly, the Akaike information criterion estimates how much information is lost when a model is applied to represent the true model of a particular dataset. See “URLs” for a link to the scripts used to perform the PCoA.

Validation of plant-associated genes in Paraburkholderia kururiensis M130 affecting rice root colonization. Growth and transformation details of P. kururiensis M130 are described in Supplementary Note 1.

Mutant construction. Internal fragments of 200–900 bp from each gene of interest were PCR-amplified with the primers listed in Supplementary Table 17c. Fragments were cloned in the pGem2T easy vector (Promega) and sequenced (GATC Biotech; Germany), then excised with EcoRI restriction enzyme and cloned in the corresponding site in pKNOCK-Km R94. These plasmids were then used as a suicide delivery system to create the knockout mutants and transferred to P. kururiensis M130 by triparental mating. All the mutants were verified by PCR with primers specific to the pKNOCK-Km vector and to the genomic DNA sequences upstream and downstream from the targeted genes.

Rhizosphere colonization experiments with P. kururiensis and mutant derivatives. Seeds of Oryza sativa (BALDO variety) were surface-sterilized and then left to germinate in sterile conditions at 30 °C in the dark for 7 d. Each seedling was then aseptically transferred into a 50-mL Falcon tube containing 35 mL of half-strength Hoagland solution semisolid substrate (0.4% agar). The tubes were then inoculated with 107 colony-forming units (cfu) of a P. kururiensis suspension. Plants were grown for 11 d at 30 °C (16/8-h light/dark cycles). For the determination of the bacterial counts, plants were washed under tap water for 1 min and then cut below the cotyledon to excise the roots. Roots were air-dried for 15 min, weighed, and then transferred to a sterile tube containing 5 mL of PBS. After vortexing, the suspension was serially diluted to 10−1 and 10−2 in PBS, and aliquots were plated on KB plates containing the appropriate antibiotic (rifampicin 50 µ g/mL for the wild type, rifampicin 50 µ g/mL and kanamycin 50 µ g/mL for the mutants). After 3 d of incubation at 30 °C, we counted colony-forming units (CFU). Three replicates for each dilution from ten independent plantlets were used to determine the average CFU values.

Plant-mimicking plant-associated and root-associated proteins. Supplementary Fig. 21 summarizes the algorithm used to find plant-mimicking plant-associated and root-associated proteins. Pfam25 version 30.0 metadata were downloaded. Protein domains that appeared in both Viridiplantae and bacteria and occurred at least two times more frequently in Viridiplantae than in bacteria were considered as plant-like domains (n = 708). In parallel, we scanned the set of significant plant-associated, root-associated, NPA, and soil-associated Pfam protein domains predicted by the five algorithms in the nine taxa. We compiled a list of domains that were significantly plant-associated/root-associated in at least four tests, and significantly NPA/soil-associated in up to two tests (n = 1,779). The overlapping domains between the first two sets were defined as PREPARADOs (n = 64). In parallel, we created two control sets of 500 random plant-like Pfam domains and 500 random plant-associated/root-associated Pfam domains. Enrichment of PREPARADOs integrated into plant NLR proteins in comparison to the domains in the control groups was tested by Fisher’s exact test. To identify domains found in plant disease-resistance proteins, we retrieved all proteins from Phytozome and BrassicaDB (“URLs”). To identify domains in plant disease-resistance proteins, we used hmmscan to search protein sequences for the presence of NB-ARC (PF00931.20), TIR (PF01582.18), TIR_2 (PF13676.4), or RPW8 (PF05659.9) domains. Bacterial proteins carrying the PREPARADO domains were considered as having full-length identity to fungal, oomycete, or plant proteins on the basis of LAST alignments to all Refseq proteins of plants, fungi, and protozoa. “Full-length” is defined here as an alignment length of at least 90% of the length of both query and reference proteins. The threshold used for considering a high amino acid identity was 40%. An explanation of the prediction of secretion of proteins with PREPARADOs is presented in the Supplementary Information.

Prediction of plant-associated, NPA, root-associated, and soil-associated operons and their annotation as biosynthetic gene clusters. Significant plant-associated, NPA, root-associated, and soil-associated genes of each genome were clustered on the basis of genomic distance: genes sharing the same scaffold and strand that were up to 200 bp apart were clustered into the same predicted operon. We allowed up to one spacer gene, which is a non-significant gene, between each pair of significant genes within an operon. Operons were predicted for the genes in COG and OrthoFinder clusters using all five approaches. Operons were annotated as biosynthetic gene clusters if at least one of the constituent genes was part of a biosynthetic gene cluster from the IMG-ABC database95.

Jekyll and Hyde analyses. To find all homologs and paralogs of Jekyll and Hyde genes, we used IMG BLAST search with an e-value threshold of 1e–5 against all IMG isolates. We searched Hyde1 homologs of Acidovorax, Hyde1 homologs of Pseudomonas, Hyde2, and Jekyll genes using proteins of genes Aave_1071, A243_06583, Ga0078621_123530, and Ga0102403_10160 as the query sequence, respectively. Multiple sequence alignments were done with Mafft96. A phylogenetic





tree of Acidovorax species was produced with RaxML97, based on concatenation of 35 single-copy genes98.

Hyde1 toxicity assay. To verify the toxicity of Hyde1 and Hyde2 proteins to E. coli, we cloned genes encoding proteins Aave_0990 (Hyde2), Aave_0989 (Hyde1), and Aave_3191 (Hyde1), or GFP as a control, to the inducible pET28b expression vector via the LR reaction. The recombinant vectors were transformed into E. coli C41 competent cells by electroporation after sequencing validation. Five colonies were selected and cultured in LB liquid media supplemented with kanamycin with shaking overnight. The OD600 of the bacterial culture was adjusted to 1.0, and then the culture was diluted by 102, 104, 106, and 108 times successively. Bacteria culture gradients were spotted (5 μ L) on LB plates with or without 0.5 mM IPTG to induce gene expression.

Construction of ∆5-Hyde1 strain. Details of the construction of the ∆ 5-Hyde1 strain are presented in Supplementary Note 1. A. citrulli strain AAC00-1 and its derived mutants were grown on nutrient agar medium supplemented with rifampicin (100 µ g/ml). To delete a cluster of five Hyde1 genes (Aave_3191–3195), we carried out a marker-exchange mutagenesis as previously described99. The marker-free mutant was designated as ∆ 1-Hyde1, and its genotype was confirmed by PCR amplification and sequencing. The marker-exchange mutagenesis procedure was repeated to delete four other Hyde1 loci (Supplementary Fig. 28). The primers used are listed in Supplementary Table 25. The final mutant, with deletion of 9 out of 11 Hyde1 genes (in five loci), was designated as ∆ 5-Hyde1 and was used for competition assay. The ∆ T6SS mutant was provided by Ron Walcott’s lab.

Competition assay of Acidovorax citrulli AAC00-1 against different strains. Bacterial strains. E. coli BW25113 pSEVA381 was grown aerobically in LB broth (5 g/L NaCl) at 37 °C in the presence of chloramphenicol. Naturally antibiotic-resistant bacterial leaf isolates16 and Acidovorax strains were grown aerobically in NB medium (5 g/L NaCl) at 28 °C in the presence of the appropriate antibiotic. Antibiotic resistance and concentrations used in the competition assay are mentioned in Supplementary Table 25.

Competition assay. Competition assays were conducted similarly as described elsewhere66,100. Briefly, bacterial overnight cultures were harvested and washed in PBS (pH 7.4) to remove excess antibiotics, and resuspended in fresh NB medium to an optical density of 10. Predator and prey strains were mixed at a 1:1 ratio, and 5 µ L of the mixture was spotted onto dry NB agar plates and incubated at 28 °C. As a negative control, the same volume of NB medium was mixed with prey cells instead of the predator strain. After 19 h of coincubation, bacterial spots were excised from the agar and resuspended in 500 µ L of NB medium and then spotted on NB agar containing antibiotic selective for the prey strains. CFUs of recovered prey cells were determined after incubation at 28 °C. All assays were performed in at least three biological replicates.

Life sciences reporting summary. Further information on experimental design is available in the Life Sciences Reporting Summary.

Data availability. All new genomes (Supplementary Table 3) were submitted and are publicly available in at least one of the following databanks (see accessions in Supplementary Table 3):

1. IMG/M, https://img.jgi.doe.gov/cgi-bin/m/main.cgi2. Genbank, https://www.ncbi.nlm.nih.gov/genbank/3. ENA, http://www.ebi.ac.uk/ena4. A dedicated website for the Dangl lab: http://labs.bio.unc.edu/Dangl/

Resources/gfobap_website/index.htmlThe dedicated website contains nucleotide and amino acid FASTA files of all

datasets used, protein/domain annotations (COG, KO, TiGRfam, Pfam), metadata, phylogenetic trees, OrthoFinder orthogroups, orthogroup hidden Markov models, full enrichment datasets, correlation between orthogroups, and predicted operons (“URLs”).

Links to different scripts that were used in analysis are included in the “URLs” section. The full genome sequence, gene annotation, and metadata of each genome

used can be found at the IMG website (https://img.jgi.doe.gov/). For example, the metadata of taxon ID 2558860101 can be found at https://img.jgi.doe.gov/cgi-bin/mer/main.cgi?section= TaxonDetail&page= taxonDetail&taxon_oid= 2558860101.

References 79. Doty, S. L. et al. Diazotrophic endophytes of native black cottonwood and

willow. Symbiosis 47, 23–33 (2009). 80. Weston, D. J. et al. Pseudomonas fluorescens induces strain-dependent and

strain-independent host plant responses in defense networks, primary metabolism, photosynthesis, and fitness. Mol. Plant Microbe Interact. 25, 765–778 (2012).

81. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013).

82. Beszteri, B., Temperton, B., Frickenhaus, S. & Giovannoni, S. J. Average genome size: a potential source of bias in comparative metagenomics. ISME J. 4, 1075–1077 (2010).

83. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).

84. Varghese, N. J. et al. Microbial species delineation using whole genome sequences. Nucleic Acids Res. 43, 6761–6771 (2015).

85. Kerepesi, C., Bánky, D. & Grolmusz, V. AmphoraNet: the webserver implementation of the AMPHORA2 metagenomic workflow suite. Gene 533, 538–540 (2014).

86. Wu, M., Chatterji, S. & Eisen, J. A. Accounting for alignment uncertainty in phylogenomics. PLoS One 7, e30288 (2012).

87. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS One 5, e9490 (2010).

88. Sen, A. et al. Phylogeny of the class Actinobacteria revisited in the light of complete genomes. The orders ‘Frankiales’ and Micrococcales should be split into coherent entities: proposal of Frankiales ord. nov., Geodermatophilales ord. nov., Acidothermales ord. nov. and Nakamurellales ord. nov. Int. J. Syst. Evol. Microbiol. 64, 3821–3832 (2014).

89. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).

90. Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).

91. Wang, Z. & Wu, M. A phylum-level bacterial phylogenetic marker database. Mol. Biol. Evol. 30, 1258–1262 (2013).

92. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300 (1995).

93. Finn, R. D. et al. HMMER web server: 2015 update. Nucleic Acids Res. 43, W30–W38 (2015).

94. Alexeyev, M. F. The pKNOCK series of broad-host-range mobilizable suicide vectors for gene knockout and targeted DNA insertion into the chromosome of gram-negative bacteria. Biotechniques 26, 824–826 (1999).

95. Hadjithomas, M. et al. IMG-ABC: a knowledge base to fuel discovery of biosynthetic gene clusters and novel secondary metabolites. MBio 6, e00932 (2015).

96. Katoh, K., Misawa, K., Kuma, K. & Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002).

97. Stamatakis, A., Hoover, P. & Rougemont, J. A rapid bootstrap algorithm for the RAxML Web servers. Syst. Biol. 57, 758–771 (2008).

98. Finkel, O. M., Béjà, O. & Belkin, S. Global abundance of microbial rhodopsins. ISME J. 7, 448–451 (2013).

99. Traore, S. M. Characterization of Type Three Effector Genes of A. citrulli, the Causal Agent of Bacterial Fruit Blotch of Cucurbits. (Virginia Polytechnic Institute and State University, Blacksburg, VA, 2014).

100. Basler, M., Ho, B. T. & Mekalanos, J. J. Tit-for-tat: type VI secretion system counterattack during bacterial cell-cell interactions. Cell 152, 884–894 (2013).



https://img.jgi.doe.gov/cgi-bin/m/main.cgi

https://www.ncbi.nlm.nih.gov/genbank/

http://www.ebi.ac.uk/ena



https://img.jgi.doe.gov/

https://img.jgi.doe.gov/cgi-bin/mer/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2558860101

https://img.jgi.doe.gov/cgi-bin/mer/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2558860101


1

nature research | life sciences reporting summ

aryJune 2017

Corresponding author(s): Jeffery L. Dangl

Initial submission Revised version Final submission

Life Sciences Reporting SummaryNature Research wishes to improve the reproducibility of the work that we publish. This form is intended for publication with all accepted life science papers and provides structure for consistency and transparency in reporting. Every life science submission will use this form; some list items might not apply to an individual manuscript, but all fields must be completed for clarity.

For further information on the points included in this form, see Reporting Life Sciences Research. For further information on Nature Research policies, including our data availability policy, see Authors & Referees and the Editorial Policy Checklist.

Experimental design1. Sample size

Describe how sample size was determined. Computational analysis is done based on sample size in the order of hunderds. Experiments were done with sample size of 3-30 biological replicates.

2. Data exclusions

Describe any data exclusions. No data was excluded.

3. Replication

Describe whether the experimental findings were reliably reproduced.

All reported results in the paper was reliably reproduced in biological replicates.

4. Randomization

Describe how samples/organisms/participants were allocated into experimental groups.

Mostly irrelevant to the study. In the few cases where we used randomizations we made sure the randomized control group has similar features to the tested group. For example in the randomization done for PREPARADOs enrichment in planr NLR immune proteins we randomly sampled 500 PA/RA domains and 500 plant protein domains.

5. Blinding

Describe whether the investigators were blinded to group allocation during data collection and/or analysis.

Irrelevant to the study.

Note: all studies involving animals and/or human research participants must disclose whether blinding and randomization were used.

6. Statistical parameters For all figures and tables that use statistical methods, confirm that the following items are present in relevant figure legends (or in the Methods section if additional space is needed).

n/a Confirmed

The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement (animals, litters, cultures, etc.)

A description of how samples were collected, noting whether measurements were taken from distinct samples or whether the same sample was measured repeatedly

A statement indicating how many times each experiment was replicated

The statistical test(s) used and whether they are one- or two-sided (note: only common tests should be described solely by name; more complex techniques should be described in the Methods section)

A description of any assumptions or corrections, such as an adjustment for multiple comparisons

The test results (e.g. P values) given as exact values whenever possible and with confidence intervals noted

A clear description of statistics including central tendency (e.g. median, mean) and variation (e.g. standard deviation, interquartile range)

Clearly defined error bars

See the web collection on statistics for biologists for further resources and guidance.

2

nature research | life sciences reporting summ

aryJune 2017

SoftwarePolicy information about availability of computer code

7. Software

Describe the software used to analyze the data in this study.

All software used is described in methods and provided to readers. We mention different R packages used in analysis.

For manuscripts utilizing custom algorithms or software that are central to the paper but not yet described in the published literature, software must be made available to editors and reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). Nature Methods guidance for providing algorithms and software for publication provides further information on this topic.

Materials and reagentsPolicy information about availability of materials

8. Materials availability

Indicate whether there are restrictions on availability of unique materials or if these materials are only available for distribution by a for-profit company.

There are no restrictions on materials availability.

9. Antibodies

Describe the antibodies used and how they were validated for use in the system under study (i.e. assay and species).

No antibodies were used.

10. Eukaryotic cell linesa. State the source of each eukaryotic cell line used. Irrelevant to the study

b. Describe the method of cell line authentication used. Irrelevant to the study

c. Report whether the cell lines were tested for mycoplasma contamination.

Irrelevant to the study

d. If any of the cell lines used are listed in the database of commonly misidentified cell lines maintained by ICLAC, provide a scientific rationale for their use.


Animals and human research participantsPolicy information about studies involving animals; when reporting animal research, follow the ARRIVE guidelines

11. Description of research animalsProvide details on animals and/or animal-derived materials used in the study.


Policy information about studies involving human research participants

12. Description of human research participantsDescribe the covariate-relevant population characteristics of the human research participants.


Documents

Genomic features of bacterial adaptation to plantslabs.bio.unc.edu/Dangl/Resources/gfobap_website/s41588... · 2017-12-18 · We also identified 64 plant-associated pro-tein domains