57
1 Short Title: Evolution of the plant HRGP superfamily 1 2 Corresponding Author 3 Carolyn J. Schultz, School of Agriculture, Food and Wine, The University of Adelaide, Waite 4 Research Institute, PMB1 Glen Osmond, SA, 5064, Australia. Ph: 61 8 448 909 881. Email address: 5 [email protected] 6 7 Insights into the evolution of hydroxyproline rich glycoproteins from 1000 plant transcriptomes 8 9 Kim L. Johnson 1* , Andrew M. Cassin 1* , Andrew Lonsdale 1 , Gane Ka-Shu Wong 2 , Douglas E. Soltis 3 , 10 Nicholas W. Miles 3 , Michael Melkonian 4 , Barbara Surek 4 , Michael K. Deyholos 5 , James Leebens- 11 Mack 6 , Carl J. Rothfels 7 , Dennis W. Stevenson 8 , Sean W. Graham 9 , Xumin Wang 10 , Shuangxiu Wu 10 , 12 J. Chris Pires 11 , Patrick P. Edger 12 , Eric J. Carpenter 2 , Antony Bacic 1 , Monika S. Doblin 1^ and Carolyn 13 J. Schultz 13^ 14 1 ARC Centre of Excellence in Plant Cell Walls, School of BioSciences, The University of Melbourne, 15 Parkville, Vic, 3010, Australia 16 2 Departments of Biological Sciences and Medicine, University of Alberta, Edmonton, Alberta, 17 Canada, and BGI-Shenzhen, Bei Shan Industrial Zone, Yantian District, Shenzhen, China 18 3 Florida Museum of Natural History, Department of Biology, University of Florida, Gainsville, 19 FL32611, USA 20 4 Botanical Institute, Cologne Biocenter, University of Cologne, D50674, Germany 21 5 Department of Biology, University of British Columbia, Kelowna BC, V1V 1V7, Canada 22 6 Department of Plant Biology, University of Georgie, Athens, GA3062, USA 23 Plant Physiology Preview. Published on April 26, 2017, as DOI:10.1104/pp.17.00295 Copyright 2017 by the American Society of Plant Biologists www.plantphysiol.org on March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

Insights into the evolution of hydroxyproline rich …...Insights into the evolution of hydroxyproline rich glycoproteins ... ... 4)

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • 1

    Short Title: Evolution of the plant HRGP superfamily 1

    2

    Corresponding Author 3

    Carolyn J. Schultz, School of Agriculture, Food and Wine, The University of Adelaide, Waite 4

    Research Institute, PMB1 Glen Osmond, SA, 5064, Australia. Ph: 61 8 448 909 881. Email address: 5

    [email protected] 6

    7

    Insights into the evolution of hydroxyproline rich glycoproteins from 1000 plant transcriptomes 8

    9

    Kim L. Johnson1*, Andrew M. Cassin1*, Andrew Lonsdale1, Gane Ka-Shu Wong2, Douglas E. Soltis3, 10

    Nicholas W. Miles3, Michael Melkonian4, Barbara Surek4, Michael K. Deyholos5, James Leebens-11

    Mack6, Carl J. Rothfels7, Dennis W. Stevenson8, Sean W. Graham9, Xumin Wang10, Shuangxiu Wu10, 12

    J. Chris Pires11, Patrick P. Edger12, Eric J. Carpenter2, Antony Bacic1, Monika S. Doblin1^ and Carolyn 13

    J. Schultz13^ 14

    1 ARC Centre of Excellence in Plant Cell Walls, School of BioSciences, The University of Melbourne, 15

    Parkville, Vic, 3010, Australia 16

    2 Departments of Biological Sciences and Medicine, University of Alberta, Edmonton, Alberta, 17

    Canada, and BGI-Shenzhen, Bei Shan Industrial Zone, Yantian District, Shenzhen, China 18

    3 Florida Museum of Natural History, Department of Biology, University of Florida, Gainsville, 19

    FL32611, USA 20

    4 Botanical Institute, Cologne Biocenter, University of Cologne, D50674, Germany 21

    5 Department of Biology, University of British Columbia, Kelowna BC, V1V 1V7, Canada 22

    6 Department of Plant Biology, University of Georgie, Athens, GA3062, USA 23

    Plant Physiology Preview. Published on April 26, 2017, as DOI:10.1104/pp.17.00295

    Copyright 2017 by the American Society of Plant Biologists

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 2

    7 University Herbarium and Department of Integrative Biology, University of California, Berkeley, 24

    CA 94720, USA 25

    8 New York Botanical Garden, 2900 Southern Blvd, Bronx, NY 10458, USA 26

    9 Department of Botany, University of British Columbia, Vancouver BC, V6T1Z4, Canada 27

    10 CAS Key Laboratory of Genome Science and Information, Beijing Institute of Genomics, Chinese 28

    Academy of Sciences, Beijing 100101, China 29

    11 Division of Biological Sciences and Bond Life Sciences Center, University of Missouri, Columbia, 30

    MO 65211, USA 31

    12 Department of Horticulture, Michigan State University, East Lansing, MI 48823, USA 32

    13 School of Agriculture, Food and Wine, The University of Adelaide, Waite Research Institute, PMB1 33

    Glen Osmond, SA, 5064, Australia 34

    * co-first contributors, ^ co-senior authors. 35

    36

    One-sentence summary 37

    Evaluation of hydroxyproline-rich glycoproteins (HRGPs) identified in 1KP transcriptomic datasets to 38

    gain insight into their evolutionary history from chlorophyte algae to flowering plants. 39

    40

    List of Author Contributions 41

    D.E.S., N.W.M., M.M., B.S., M.K.D., J.L.M., C.J.R., D.W.S., S.W.G., J.C.P., P.P.E., X.W. and S.W. 42

    selected species for analysis and collected plant and algal samples and M.M., B.S., M.K.D., J.C.P., 43

    C.J.R., D.W.S., S.W., extracted RNA for sequencing. J.L.M. organized transcripts into gene families 44

    and S.W. screened RNA-seq data for contamination. E.J.C. and G.K.S.W. generated the sequence 45

    data, E.J.C. setup and maintained 1KP web-resources used to communicate data and G.K.S.W. 46

    oversaw the 1KP project. A.M.C. generated the HRGP website and K.L.J., A.M.C., M.S.D., C.J.S. 47

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 3

    and A.B. designed and oversaw the research. A.L. provided bioinformatics support and assisted in 48

    maintenance of the HRGP website. K.J.L., A.M.C., M.S.D. and C.J.S. analyzed the data and together 49

    with A.B. wrote the manuscript. 50

    51

    Funding Information 52

    The authors wish to acknowledge the 1000 plants (1KP) initiative, which is led by GKSW and funded 53

    by the Alberta Ministry of Enterprise and Advanced Education, Alberta Innovates Technology Futures 54

    (AITF), Innovates Centre of Research Excellence (iCORE), and Musea Ventures. KLJ, AMC, AL, 55

    MSD & AB acknowledge the support of a grant from the Australia Research Council to the ARC 56

    Centre of Excellence in Plant Cell Walls (CE1101007) that is partly funded by the Australian 57

    Government’s National Collaborative Research Infrastructure Program (NCRIS) administered through 58

    Bioplatforms Australia (BPA) Ltd. This research was supported by a Victorian Life Sciences 59

    Computation Initiative (VLSCI) grant number VR0191 on its Peak Computing Facility at the 60

    University of Melbourne, an initiative of the Victorian Government, Australia. 61

    62

    Corresponding Author 63

    Email address: [email protected] 64

    65

    Keywords 66

    Hydroxyproline-rich glycoproteins (HRGPs), arabinogalactan-proteins (AGPs), extensins (EXTs), 67

    proline-rich proteins (PRPs), transcriptomics, plant evolution, 1000 plants (1KP), 68

    glycosylphosphatidylinositol (GPI)-anchors, intrinsically disordered proteins (IDP) 69

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 4

    Abstract 70

    The carbohydrate-rich cell walls of land plants and algae have been the focus of much interest given 71

    the value of cell wall based products to our current and future economies. Hydroxyproline-rich 72

    glycoproteins (HRGPs), a major group of wall glycoproteins, play important roles in plant growth and 73

    development, yet little is known about how they have evolved in parallel with the polysaccharide 74

    components of walls. We investigate the origins and evolution of the HRGP superfamily, which is 75

    commonly divided into three major multigene families: the arabinogalactan-proteins (AGPs), 76

    extensins (EXTs) and proline-rich proteins (PRPs). 77

    Using MAAB, a newly developed bioinformatics pipeline, we identified HRGPs in sequences from 78

    the 1000 plants (1KP) transcriptome project (www.onekp.com). Our analyses provide new insights 79

    into the evolution of HRGPs across major evolutionary milestones, including the transition to land and 80

    early radiation of angiosperms. Significantly, data mining reveals the origin of GPI-anchored AGPs in 81

    green algae and a 3-4 fold increase in GPI-AGPs in liverworts and mosses. The first detection of 82

    cross-linking (CL)-EXTs is observed in bryophytes, which suggest that CL-EXTs arose though the 83

    juxtaposition of pre-existing SPn EXT glycomotifs with refined Y-based motifs. We also detected the 84

    loss of CL-EXT in a few lineages, including the grass family (Poaceae), that have a cell wall 85

    composition distinct from other monocots and eudicots. A key challenge in HRGP research is tracking 86

    individual HRGPs throughout evolution. Using the 1KP output we were able to find putative orthologs 87

    of Arabidopsis pollen-specific GPI-AGPs in basal eudicots. 88

    89

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 5

    Introduction 90

    Cell walls of plants and algae are widely used for food, textiles, paper and timber, yet our 91

    understanding of their assembly and dynamic re-modelling in response to growth, development and 92

    environmental stresses (abiotic and biotic) remains rudimentary (Doblin et al., 2010; Doblin et al., 93

    2014). There are two contrasting types of walls/extracellular matrices in the green plant lineage, 94

    protein-rich walls of some green algae and the cellulose-rich walls of embryophytes (land plants). 95

    Recent studies of cell wall evolution suggest that the origins of many wall components occurred in the 96

    streptophyte green algae prior to the evolution of embryophytes (Sørensen et al., 2010; Popper et al., 97

    2011; Domozych et al., 2012) with elaboration of a pre-existing set of polysaccharides rather than an 98

    entirely new polymer framework. Although both the ultrastructure of plant walls and the fine structure 99

    of their component polymers varies widely, they all typically comprise a fibrillar network of cellulose 100

    microfibrils that is embedded within a gel-like matrix of non-cellulosic polysaccharides, pectins and 101

    (glyco)proteins; the latter being both structural and enzymatic (Somerville et al., 2004; Doblin et al., 102

    2010). 103

    104

    Research to date has largely focused on the evolution of the polysaccharides, cellulose and the non-105

    cellulosic polysaccharides (hemicelluloses) and pectins, primarily through the availability of 106

    polysaccharide epitope-specific antibodies coupled with immuno-fluorescence and/or transmission 107

    electron microscopy and arraying techniques such as comprehensive microarray polymer profiling, 108

    (CoMPP), and plate-based arrays (Moller et al., 2007; Pattathil et al., 2010; Moore et al., 2014). In 109

    contrast, relatively little is known about the origin and evolution of the hydroxyproline-rich 110

    glycoproteins (HRGPs), a major group of wall glycoproteins, despite their widespread occurrence 111

    (Supplemental Table S1) and importance in plant growth and development (reviewed in Fincher et 112

    al., 1983; Kieliszewski and Lamport, 1994; Majewska-Sawka and Nothnagel, 2000; Ellis et al., 2010; 113

    Lamport et al., 2011; Draeger et al., 2015; Velasquez et al., 2015; Showalter and Basu, 2016). 114

    Examination of these glycoproteins in a wider range of plant species should provide valuable insights 115

    into how they have evolved in parallel with the polysaccharide components in plant walls. 116

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 6

    117

    HRGPs are commonly divided into three major multigene families, the highly glycosylated 118

    arabinogalactan-proteins (AGPs), the moderately glycosylated extensins (EXTs) and minimally 119

    glycosylated proline-rich proteins (PRPs). The protein backbones of these HRGPs are distinguished 120

    by differences in amino acid bias and the repeat motifs characteristic to each family, and include the 121

    post-translational modification of proline (Pro, P) to hydroxyproline (Hyp, O) (see Johnson et al., 122

    companion paper). Each of these families is further elaborated by chimeric and hybrid HRGPs, 123

    complicating not only the classification of members but also elucidation of their functions. 124

    125

    Classical AGPs consist of up to 99% carbohydrate due to the addition of large type II arabino-3,6-126

    galactan (AG) polysaccharides (degree of polymerization, DP, 30-120) on to the minor (1-10%) 127

    protein backbone. The complexity of this family is immense and lies both in the diversity of protein 128

    backbones containing AGP glycomotifs and the incredible heterogeneity of glycan structures and 129

    sugars decorating these proteins. The specific binding of AGPs to the dye β-glucosyl Yariv reagent 130

    (β-Glc Yariv) and the generation of AGP-glycan specific antibodies have provided valuable insight 131

    into their function, structure and evolution (Supplemental Table S1) (Jermyn and Yeow, 1975; van 132

    Holst and Clarke, 1985; Kitazawa et al., 2013; Paulsen et al., 2014). These studies suggest AGPs are 133

    evolutionarily ancient. O-glycosylation occurs in bryophytes, some chlorophytes and brown algae as 134

    shown by binding of β-Glc Yariv and/or isolation of Hyp-containing glycoproteins (Godl et al., 1997; 135

    Ligrone et al., 2002; Lee et al., 2005; Shibaya and Sugawara, 2009; Herve et al., 2016) (Johnson et al., 136

    companion paper). Classical AGPs are likely to exist in the chlorophyte algae as one of the first-137

    cloned HRGP genes (NCBI AAB23258.2) from Chlamydomonas reinhardtii, CrZSP1 (Woessner and 138

    Goodenough, 1989), is predicted to be a GPI-AGP according to our criteria (Johnson et al., 139

    companion paper). It is likely that AGPs in the chlorophyte algae have distinct functions from 140

    streptophytes, as no large AG polysaccharides have been reported (Ferris et al., 2001). Short, 141

    branched (Ferris et al., 2001) and linear Ara- and galactose (Gal)-containing oligosaccharides (Bollig 142

    et al., 2007) have been isolated from C. reinhardtii walls. 143

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 7

    144

    The highly glycosylated AGPs are soluble and have been proposed to act as mobile signals, 145

    potentially involved in cell-cell communication (Schultz et al., 1998; Motose et al., 2004; Pereira et 146

    al., 2014). Linking of AGPs to the plasma membrane by a GPI-anchor is significant, as it points 147

    towards their increased organization within the cell wall-plasma membrane interface. The finding of 148

    GPI-AGPs in lipid ‘rafts/nano-domains’ in eudicots is intriguing as these have been proposed to 149

    stabilize and increase the activity of proteins and protein complexes at the plasma membrane involved 150

    in cell-cell communication, signal transduction, immune responses, and transport (Borner et al., 2005; 151

    Grennan, 2007; Tapken and Murphy, 2015). It is tempting to speculate on a role for GPI-AGPs in 152

    similar processes in all species in which they occur, although experimental evidence is largely 153

    lacking. 154

    155

    AGPs have been most studied in embryophytes and have been isolated from a wide range of tissues, 156

    including seeds, roots, stems, leaves, inflorescences, and are particularly abundant in the stems of 157

    gymnosperms and angiosperms (Gaspar et al., 2001). The co-occurrence of specific glycan epitopes 158

    with developmental stages, the expression pattern of AGP genes, and mutant studies has implicated 159

    AGPs in many processes. These include roles in plant growth and development, cell expansion and 160

    division, hormone signaling, embryogenesis of somatic cells, differentiation of xylem and responses 161

    to abiotic stress. The heterogeneity of the AGP family and vast number of family members suggests 162

    they are multifunctional, similar to what is found in mammalian proteoglycan/glycoproteins and 163

    intrinsically disordered proteins (IDPs) (Filmus et al., 2008; Schaefer and Schaefer, 2010; Tan et al., 164

    2012). Given that diverse members could have the same glycosylation and conversely that the same 165

    backbone could have alternate glycosylated forms, determining the function of any single AGP and 166

    the establishment of precise AGP function is a major challenge. Despite the magnitude of this task, 167

    enormous progress has been made. For example, two well characterized AGPs are the pollen-specific 168

    AtAGP6 and AtAGP11 that play important roles in pollen development and female tissue 169

    interactions, and are suggested to be involved in multiple, integrated signaling pathways (Nguema-170

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 8

    Ona et al., 2012). Investigation of when these AGPs evolved would aid us to ascertain their specific 171

    functions in pollen development. 172

    173

    The EXTs are an equally diverse family of HRGPs whose members contain repetitive SO3-5 motifs 174

    that direct moderate glycosylation with Ara-oligosaccharides (arabinosides) (DP of 1 to 4). Classical 175

    EXTs play important structural roles through cross-linking into the wall, facilitated by Tyr (Y)-based 176

    motifs such as VXY and YXY (Tierney and Varner, 1987; Lamport et al., 2011). This cross-linking 177

    into the wall can have essential roles in plant development, for example the requirement for 178

    Arabdiopsis AtEXT3 during embryogenesis for cell plate formation (Cannon et al., 2008) and 179

    AtEXT18 in pollen cell wall integrity (Choudhary et al., 2015). Cross-linking of EXTs into the wall is 180

    proposed to result in the cessation of cell expansion, and yet some EXTs are important for cell 181

    elongation in root hairs (Velasquez et al., 2011; Velasquez et al., 2015). The importance of 182

    glycosylation on EXT-like motifs in the moss Physcomitrella patens has recently been shown through 183

    mutants in Hyp O-arabinosyltransferases (HPATs) (MacAlister et al., 2016). Reduced levels of cell 184

    wall-associated Hyp-arabinosides (as found in EXTs) in P. patens resulted in increased elongation of 185

    protonemal tip cells. Therefore, different classes of EXTs have evolved different biological roles and 186

    these functions are governed by both their protein backbone and glycosylation sequences and patterns. 187

    188

    The first EXTs were isolated from tomato (Lamport, 1977) and proteins with blocks of SO3-5 motifs 189

    have been extracted from other angiosperms such as tobacco, carrot, sugar beet and alfalfa (Chen and 190

    and Varner, 1985; Smith et al., 1986; Li et al., 1990) as well as other green plant lineages, including 191

    gymnosperms (Fong et al., 1992), the likely sister group to angiosperms; and chlorophycean alga 192

    (Woessner and Goodenough, 1989), representing the sister to the remaining green plants. EXTs from 193

    these other green plant lineages can have distinct features from embryophyte EXTs, and we use the 194

    term cross-linking (CL)-EXTs to distinguish EXTs with alternating SOn and Y-based motifs, which 195

    predominantly occur in land plants, from the HRGPs that have Y residues in a different domain from 196

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 9

    the SOn-rich domains, as occurs in volvocine algae (see Fig. 1, Johnson et al., companion paper). 197

    Exceptions may exist, for example no ‘classical’ EXTs were detected in the Hyp-poor cell walls of 198

    maize, which also do not have detectable isodityrosine (IDT), an intramolecular cross-link that occurs 199

    between Y-residues in EXTs (Kieliszewski et al., 1990; Kieliszewski and Lamport, 1994). 200

    201

    The PRPs are the most difficult family of HRGPs to define, as very few have been studied to date. 202

    Investigation of PRPs has largely been restricted to legumes and Arabidopsis where they have been 203

    shown to be minimally glycosylated, if at all (Averyhart-Fullard et al., 1988; Datta et al., 1989; Kleis-204

    San Francisco and Tierney, 1990; Lindstrom and Vodkin, 1991; Fowler et al., 1999). The majority of 205

    PRPs are chimeric as they contain an Ole PFAM domain, and they have been implicated in roles in 206

    root development (Bernhardt and Tierney, 2000), stress responses (Li et al., 2016), fiber development 207

    (Xu et al., 2013) and nodule formation (Sherrier et al., 2005). Despite these important functions, 208

    information on the origins and evolution of PRPs remains fundamentally unexplored. 209

    210

    Collectively, the functional characterization of HRGPs points towards their involvement in many 211

    processes within the wall, including signaling and cell wall integrity pathways (Costa et al., 2013; 212

    Pereira et al., 2015; Pereira et al., 2016; Voxeur and Hofte, 2016). As plants evolved multicellularity 213

    and transitioned to land it is likely that greater complexity in signaling networks was required, a role 214

    that could seemingly be fulfilled by HRGPs due to their distinct protein and glycan characteristics. It 215

    is currently unknown if the occurrence of specific HRGPs is associated with evolutionary milestones 216

    and specialization of plant cell walls. Greater insight into the origins, evolution and diversity of 217

    HRGPs will enable researchers to address these key issues. 218

    219

    To study HRGP function throughout plant evolution, we developed a robust method for their detection 220

    that exploits the amino acid bias of the predicted protein backbone and characteristic glycomotifs (see 221

    Johnson et al., companion paper). The motif and amino acid bias (MAAB) pipeline classifies HRGP 222

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 10

    sequences into one of 24 pre-defined descriptive sub-classes (23 HRGP classes and one MAAB class) 223

    and is optimized for recovery of GPI-AGPs, non-GPI-AGPs and cross-linking (CL)-EXTs. Here, we 224

    use this automated bioinformatics pipeline to identify and classify HRGPs within the extensive 1000 225

    plant (1KP) transcriptome data. 226

    227

    The 1KP project (www.onekp.com) has generated large-scale transcriptomic data for over 1000 plant 228

    and algal species, thereby providing deeper sampling of key plant taxa beyond those with sequenced 229

    genomes (Johnson et al., 2012; Matasci et al., 2014; Wickett et al., 2014). Species selected include 230

    mostly non-model land plants, many of which are important for medicine, agriculture, forestry and 231

    biodiversity/conservation, as well as red algae (Rhodophyta), green algae (chlorophytes and algal 232

    streptophytes) and taxa from a small group of freshwater unicellular algae known as glaucophytes - 233

    collectively termed the Archaeplastida (Plantae). Analysis of the 1KP data is already helping resolve 234

    long-standing questions, such as from which streptophyte green algal lineage the progenitors of land 235

    plants likely arose (Wickett et al., 2014). The 1KP transcriptome dataset therefore offers an 236

    unprecedented opportunity to investigate HRGP superfamily evolution and determine if changes 237

    coincide with major evolutionary transitions in the Archaeplastida. 238

    239

    A tandem repeat annotation library (TRAL) is also used to identify repeats in MAAB sequences and 240

    support findings of selective loss and/or gain of CL-EXTs in bryophytes and commelinid monocots. 241

    A combination of BLAST and profile hidden Markov models was used to search for putative 242

    orthologs of pollen-specific AtAGP6/11 in angiosperms (flowering plants) with hits validated by 243

    phylogenetic analysis. These vignettes demonstrate the broad utility of the bioinformatics pipeline on 244

    the 1KP data, providing evolutionary insights into the complexity of the HRGP superfamily. 245

    246

    Results and Discussion 247

    MAAB pipeline for GPI-AGPs and CL-EXTs 248

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 11

    Using the newly developed bioinformatics tool, the motif and amino acid bias (MAAB) pipeline (see 249

    Johnson et al., companion paper), we evaluated the 1KP dataset for HRGPs. We set parameters in 250

    MAAB to best detect the classical, non-chimeric GPI-AGPs, CL-EXTs, PRPs and non-GPI-AGPs 251

    (classes 1-4, respectively) as well as the hybrid HRGPs (classes 5-23) using features of the well 252

    characterized Arabidopsis sequences. MAAB was validated using 15 proteomes from Phytozome and 253

    shown to reliably detect and accurately classify HRGPs into 23 descriptive sub-classes (see Johnson 254

    et al., companion paper). MAAB represents a significant advance of previous methods to detect 255

    HRGPs in being able to identify all classes in one pipeline, without manual curation. 256

    257

    MAAB did not identify any false positives in Phytozome datasets and evaluation of a subset of the 258

    1KP datasets also suggests a very low rate of false positives (see Johnson et al., companion paper). In 259

    Ranunculales, a well-sampled basal eudicot order (39 samples derived from 20 species), we analyzed 260

    143 non-redundant GPI-AGP sequences (five Ranunculaceae and six Papaveraceae datasets; 261

    Supplemental Fig. S1A) and observed 98.6% true positives, with only two sequences not being clear 262

    GPI-AGPs (class 1) and four other sequences flagged as likely contaminants based on their 263

    appearance in multiple unrelated datasets (see legend, Supplemental Fig. S1). The rate of false 264

    positives for CL-EXTs (class 2) was estimated based on analysis of 45 non-redundant bryophyte 265

    sequences (Supplemental Fig. S2) derived from the larger set of 67 identified within the 1KP data 266

    (Fig. 1). Shading of motifs reveals that 42/45 sequences (93.3%) have the expected CL-EXT motifs 267

    throughout the entire backbone. The three remaining sequences could be considered CL-EXTs, or 268

    hybrid CL-EXTs-AGPs or AGPs as they have more AGP motifs than SPn motifs, but both motifs 269

    types are present (Supplemental Fig. S2). In either case, they are definitive HRGPs suggesting the 270

    true positive rate is 100% (45/45), although we cannot rule out that the sequences are partial chimeric 271

    HRGPs until full length sequences become available. 272

    273

    Using the MAAB pipeline to identify and classify HRGPs in transcriptomic datasets spanning 274

    large evolutionary timescales 275

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 12

    Across the entire 1KP data, 45304 HRGPs (class 1 to 23) were identified (Fig. 1) using a multiple k-276

    mer approach (see Johnson et al., companion paper). The majority of sequences detected were in 277

    HRGP classes 1 to 4 (38051) and MAAB class 24 (39933), with a grand total of 85237 sequences 278

    across all 24 MAAB classes. We compared HRGPs identified for each of the 1KP taxonomic groups 279

    (hereafter 1KP groups, which includes clades and grades, see 280

    http://www.onekp.com/samples/list.php), by calculating the percentage of sequences in the classical 281

    (class 1-4) and minor HRGP classes (class 5-23) and the final MAAB class (class 24) out of the total 282

    sequences classified by MAAB (Fig. 2A). The majority of 1KP groups had a similar proportion of 283

    sequences in classes 1-4 (avg. ~50%), 5-23 (avg. ~10%) and 24 (avg. ~40%). However, the algal 284

    groups including those well represented in the 1KP dataset (green algae, comprising chlorophytes and 285

    algal streptophytes, and red algae and Chromista) exhibited a trend towards lower percentages of 286

    HRGP sequences (classes 1-4 and 5-23) and a higher percentage of sequences in MAAB class 24 (Fig. 287

    2A). The relatively low number of GPI-AGPs (class 1) and higher number of class 4 sequences most 288

    likely reflects underlying functional differences in algal AGPs. The high number of class 24 289

    sequences, likely either non-HRGPs or unknown HRGPs, suggests that algae have a more diverse 290

    array of IDPs than land plants. The possibility that HRGP transcripts have not been detected within k-291

    mer assemblies is also possible and could be due to sequence quality and/or sampling issues (lower 292

    than optimal number of datasets and tissue/s, see below). 293

    294

    As a measure of the depth of taxon sampling, the number of taxonomic orders represented and the 295

    number of samples within each 1KP group is reported in Fig. 2B. The green alga group are the best 296

    represented of the algae with 34 orders and 152 samples included in our analyses and hence provide 297

    the most comprehensive dataset for comparison. The average number of sequences for all MAAB 298

    classes (1-24) is summarized by 1KP group (Fig. 2C). As each group represents a variable number of 299

    species, and not all species in a group have the same HRGPs, we have also summarized the detection 300

    rate as a heat map by 1KP group (Fig. 2C) and by taxonomic order (Supplemental Table S2). This 301

    allowed us to assess whether detection of a particular HRGP class is either real or spurious. If ≥50% of 302

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 13

    species in a given taxonomic order have ≥1 sequence for a given HRGP class, we report a >50% 303

    detection rate, and assume this HRGP class is widespread throughout the taxonomic order. The few 304

    class 2 (CL-EXTs) sequences identified in green algae and Chromista are likely sporadic contaminants 305

    based on the low mean number of CL-EXTs (0.1 for green algae, 0.2 for Chromista) (Fig. 2C), low 306

    detection rates (Fig. 2C; Supplemental Table S2) and BLASTp analysis revealing high sequence 307

    similarity to vascular plant CL-EXTs (data not shown). 308

    309

    Here we present a preliminary analysis of the minor HRGP classes (5 to 23), and will then focus on 310

    the major HRGP classes (1 to 4). Classes 5 to 23 include ‘hybrid’ type HRGPs with motifs associated 311

    with two or more classes of HRGP (see Fig 4, Johnson et al., companion paper). For example, class 5 312

    HRGPs contain AGP-bias with CL-EXT motifs (both SP3-5- and Y-based), and are most highly 313

    represented in gymnosperms (conifers, Gnetales, Ginkgo and Cycadales) (Fig. 2C). In general, the 314

    detection of classes 5 to 23 is low and sporadic for most clades with the exception of conifers and core 315

    eudicots. In these clades a high detection rate (>50%) of sequences occurs for most HRGP classes 316

    (Fig. 2C). This suggests that more hybrid/unusual HRGPs occur in these clades, although the low 317

    mean numbers indicate that these sequences are found in a few, rather than all datasets within core 318

    eudicot and conifer orders. Datasets from the 1KP green algae group have a relatively high number of 319

    sequences in class 6 (1.4 ± 4, AGP-bias, EXT glycomotifs (SPn) but not Y-based cross-linking motifs) 320

    (Fig. 2C), which is consistent with our current knowledge of HRGPs isolated from Chlamydomonas 321

    and Volvox (Adair et al., 1983; Woessner and Goodenough, 1989; Woessner et al., 1994; Ferris et al., 322

    2005). The majority of CL-EXTs identified in Arabidopsis are not predicted to be GPI-anchored (see 323

    Table I; Supplemental Table S2; Johnson et al., companion paper) and consistent with this, only a few 324

    GPI-anchored EXTs (class 9) were observed in Phytozome and 1KP data. Only one class, GPI-325

    anchored PRPs (class 14), had no representatives from any 1KP group, which is consistent with our 326

    current, albeit limited, knowledge of PRPs (Averyhart-Fullard et al., 1988; Datta et al., 1989; Fowler 327

    et al., 1999). 328

    329

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 14

    The mean number of GPI-AGPs and CL-EXTs in the 1KP data was lower than expected, particularly 330

    in the core eudicots/rosids (~4 and 7 respectively, Fig. 2C), compared to the numbers in the majority 331

    of eudicot and commelinid genomic datasets (see Table I, Johnson et al., companion paper). We 332

    suspected that this might be due to the restricted number of tissue types used for sequencing and so we 333

    investigated a subset of 1KP datasets in the Ranunculales (39 samples derived from 20 species). For 334

    each dataset, we manually removed likely redundant sequences (see Methods), reducing the mean 335

    number of GPI-AGP sequences from 7.3 to 4.8 (Supplemental Fig. S1A) and CL-EXTs from 5.1 to 336

    2.9 (Supplemental Fig. S1B). Most of the redundant sequences contained variably sized and 337

    positioned deletions (data not shown). The majority of the 1KP angiosperm datasets are from leaf 338

    tissue, yet this tissue has the lowest average number of GPI-AGPs (4.0), with 4.4 in roots, 5.0 in 339

    flowers, and 6.0 in developing fruits (Supplemental Fig. S1A). Leaf tissue datasets also had a low 340

    number of CL-EXTs (2.6) compared to roots (3.8) and fruits (3.3) (Supplemental Fig. S1B). When 341

    all unique sequences are tallied across the four Papaver spp, 12 GPI-AGPs (Supplemental Fig. S1C) 342

    and six CL-EXTs were identified (Supplemental Fig. S1D). Thus, the total numbers of GPI-AGPs 343

    and CL-EXTs observed in most 1KP datasets are under-estimates due to limited tissue sampling, 344

    despite the likely sequence redundancy that was not routinely removed by MAAB pipeline processing. 345

    346

    Significantly, the greater species sampling within 1KP mostly corroborated our Phytozome 347

    observations and allow us to draw some major conclusions: (1) AGPs are widespread, both GPI-AGPs 348

    and non-GPI-AGPs being convincingly found in the green algae group as well as Chromista and 349

    Glaucophyta; (2) CL-EXTs occur in most groups of extant land plants including bryophytes (non-350

    vascular plants); and (3) non-chimeric PRPs (class 3) are uncommon and largely confined to eudicots 351

    (Fig. 2C). 352

    353

    GPI-AGPs evolved early in the green plant lineage 354

    Our data show that GPI-AGPs are present in all algal groups sampled in the 1KP project with the 355

    exception of the Rhodophyta. Absence of GPI-AGPs in red algae is consistent with several other 356

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 15

    independent studies. A detailed analysis of the completed genome of two unicellular red algae, 357

    Cyanidioschyzon merolae (Hashimoto et al., 2009; Ulvskov et al., 2013) and Galdieria sulphuraria 358

    (Ulvskov et al., 2013) revealed that they lack the key enzymes for the synthesis of GPI-anchors, and 359

    therefore GPI-AGPs would not be expected to be present in these algae. To our knowledge there are 360

    no reports in the literature of red algal AGPs based on either AGP glycan epitopes or β-Glc Yariv 361

    staining (Supplemental Table S1). AGPs are not well known in the Chromista, which include brown 362

    algae, although there are reports of low levels of Hyp (

  • 16

    patens (Lee et al., 2005) and some vascular plants (Nguema-Ona et al., 2012). β-Glc Yariv staining in 384

    eight liverwort species revealed ubiquitous tissue staining, with darker staining in apices of some 385

    species, suggesting that apical targeting of AGPs may be a feature that arose in bryophytes (hornworts, 386

    mosses and liverworts) (Basile and Basile, 1987). AGPs have also been implicated in cell plate 387

    formation in liverworts (Shibaya and Sugawara, 2009). Since apical targeting as well as AGP delivery 388

    to phragmoplasts (a scaffold for cell plate assembly) exists in streptophyte algae (Charales, 389

    Coleochaetales and Zygnematales), it is possible that these two functions of AGPs evolved earlier than 390

    the emergence of bryophytes (Leliaert et al., 2012; Bowman, 2013). 391

    392

    It should be noted, however, that β-Glc Yariv staining and glycan epitope labelling can provide only 393

    general information about presence/absence of AGP-specific glycans, as these tools are not able to 394

    distinguish the different sub-classes of AGPs. The information revealed by MAAB in the 1KP data 395

    detailing the suite of AGPs present in bryophytes in addition to the increasing use of the bryophyte 396

    models P. patens (Kofuji and Hasebe, 2014) and Marchantia polymorpha (Ishizaki et al., 2016) will 397

    facilitate molecular approaches to investigate the function of specific AGPs through, for example, 398

    mutant studies. 399

    400

    Origin and selective loss of CL-EXTs 401

    Previous studies of HRGPs show that EXT-like Hyp-arabinosides are evolutionarily ancient and are 402

    found in chlorophyte algae, such as Chlorella vulgaris (Lamport and Miller, 1971) and C. reinhardtii 403

    (Ferris et al., 2001; Bollig et al., 2007). Our data suggest that CL-EXTs arose in bryophytes, as no 404

    CL-EXTs were detected in the two volvocine (chlorophyte green algal) genomes (see Table I, Johnson 405

    et al., companion paper) and only contaminating sequences were detected in algal 1KP datasets (Fig. 406

    2). EXT epitopes have been identified in some chlorophyte algal species (Domozych et al., 2009; 407

    Sørensen et al., 2011) (Supplemental Table S1) but this likely reflects the presence of chimeric 408

    and/or hybrid EXTs rather than CL-EXTs. Similar to the Phytozome analysis, hybrid HRGPs with 409

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 17

    both AGP and EXT glycomotifs were found in the 1KP green algae group (e.g. RYJX_Locus_857 410

    (Pandorina morum) and VALZ_Locus_7817 (C. noctigama), both class 19). Where Y residues are 411

    present, these are found in a separate domain to the P-rich domain (see Fig. 1, Johnson et al., 412

    companion paper). This result suggests that CL-EXTs arose though the juxtaposition of pre-existing 413

    SPn EXT glycomotifs with refined Y-based motifs (e.g. YXY and VXY), rather than via the 414

    simultaneous evolution of both protein motifs. 415

    416

    To gain a better understanding of the occurrence of CL-EXTs in bryophytes, the phylogenetic 417

    distribution of CL-EXTs present in the 1KP data was assessed in more detail (Fig. 3). CL-EXTs were 418

    detected by the MAAB pipeline in all five hornwort species representing two taxonomic orders 419

    (Supplemental Fig. S2) whereas they were identified in only 5 of 16 orders of mosses (Hypnales, 420

    Pottiales, Diphysciales, Buxbaumiales, and Sphagnales) and 2 of 8 orders of liverworts (Marchantiales 421

    and Sphaerocarpales) that were sampled by 1KP (Supplemental Table S2). This relatively low 422

    detection rate in mosses and liverworts is unlikely to be due to low species representation as it is 423

    comparable to the sampling depth in hornworts (Fig. 3). The data therefore suggest that while CL-424

    EXTs are widespread in hornworts, they have a more limited distribution within the moss and 425

    liverwort lineages. If hornworts are the sister group of other land plants (Wickett et al., 2014), then 426

    this may imply that CL-EXTs were lost from some lineages of liverworts and mosses. An alternative 427

    explanation is that CL-EXT genes are present in moss and liverworts but are expressed at very low 428

    levels and were therefore not detected. 429

    430

    To exclude the possibility that MAAB missed partial CL-EXT sequences due to a strict criterion for an 431

    ER signal sequence (see Johnson et al., companion paper), a tandem repeat annotation library (TRAL) 432

    was generated to identify CL-EXT repeat motifs (Schaper et al., 2015). The TRAL library identified 433

    all repeats in class 2 sequences that were longer than 10 amino acids, present in ≥4 copies and 434

    significantly different from random sequence evolution (p-value < 0.05; see Methods). A library for 435

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 18

    MAAB class 24 sequences was also generated as a control. Each repeat was used to construct a 436

    circular sequence profile hidden Markov model (cphmm) and these models were used to search the 437

    MAAB input data (see Data file 2, Johnson et al., companion paper) to identify all sequences that 438

    contain these repeats. A simple tool for browsing the results of TRAL is available at 439

    http://services.plantcell.unimelb.edu.au/hrgp/index.html and this tool was used to search all the 440

    bryophyte species for TRAL hits. 441

    442

    There is a good correlation between the number of species with CL-EXTs from MAAB and those with 443

    SPn-/Y-based repeats identified by TRAL (Fig. 3; Data file 6). Most importantly, there was no 444

    increase in the number of bryophyte species containing TRAL repeats (in the MAAB input data; see 445

    Data file 2, Johnson et al., companion paper) compared to MAAB class 2 sequences, suggesting that 446

    only a small number of partial CL-EXTs are excluded by the MAAB pipeline. 447

    448

    TRAL repeats from class 24 were also searched as these could identify both novel HRGP motifs and 449

    other repeats of interest that may have been excluded by the MAAB-imposed filters. The majority of 450

    bryophyte species had hits to TRAL repeats generated from class 24 sequences; however, none 451

    contained both SPn and Y repeats and only a few have repeats containing Y, suggesting that class 24 452

    contains few, if any CL-EXTs. Combined, TRAL and MAAB outputs indicate that there are a number 453

    of moss and liverwort species that lack CL-EXTs. This leads us to suggest that there has been either 454

    selective loss (e.g. Jungermanniales and most species from Hypnales) or multiple independent gains 455

    (e.g. Sphagnales and Marchantiales) of CL-EXTs in mosses and liverworts. Confirming this result 456

    using genome sequencing of selected species and investigation of the morphology and development of 457

    mosses and liverworts with and without CL-EXTs would be an interesting avenue for future research. 458

    These non-vascular plants may also help uncover the role of CL-EXTs with different sequence motifs 459

    and repeat lengths. 460

    461

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 19

    CL-EXTs have been lost in many grass species 462

    Evaluation of Phytozome data shows CL-EXTs containing both SPn≥3 and Y-based motifs are found in 463

    almost all eudicot species, but are surprisingly absent from Eucalyptus grandis (eudicot/rosid), 464

    Aquilegia coerulea (basal eudicot), and the four species of grasses (monocot/commelinid; Poaceae). 465

    No CL-EXTs were found in either P. patens or the two volvocine algal species (see Table I, Johnson 466

    et al., companion paper). The low number of HRGPs in some of these species is suspected to be due 467

    to annotation and assembly issues, based on our detection of CL-EXT-like open reading frames in 468

    these genomes using 1KP-identified CL-EXTs as query sequences (see Methods). In contrast, this 469

    approach was not successful for genomes from the commelinid monocots ((APG IV, 2016); 470

    designated as ‘monocots/commelinids’ in the 1KP group nomenclature). This absence is convincing 471

    because there are 17 different grass species (21 datasets) in the 1KP data and one species, Eleusine 472

    coracana, is represented by four different datasets (leaves, flowers, un/stressed roots) (see Data file 3, 473

    Johnson et al., companion paper). As an additional check we employed BUSCO (Benchmarking 474

    Universal Single-Copy Orthologs), an analysis tool that quantitatively assesses transcriptome 475

    completeness based on the presence of near-universal single-copy orthologous genes selected from 476

    OrthoDB (Simao et al., 2015). Between 65-92% (excluding three outliers) of 956 single-copy genes 477

    were identified in the monocots/commelinids datasets (Fig. 4B). The value for the Poaceae was at the 478

    high end of this range (median 88.4%, mean 83.8%), suggesting that the grass transcriptome 479

    assemblies are high quality and that CL-EXTs should have been detected if they were present. Our 480

    findings are also supported by a recent bioinformatics study of Z. mays and O. sativa genomes, where 481

    no CL-EXTs were identified (Liu et al., 2016). 482

    483

    Primary cell walls of commelinid monocots have a composition distinct from eudicots such as 484

    Arabidopsis, and gymnosperms (Bacic et al., 1988; Carpita and Gibeaut, 1993; Doblin et al., 2010). 485

    We therefore investigated whether an absence of CL-EXTs was also observed in other commelinid 486

    families represented by 1KP. As for Poaceae, CL-EXTs were not detected in the sedge family 487

    Cyperaceae (total of three species), and three other families within Poales, although they were 488

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 20

    identified in Bromeliaceae and Restionaceae (1 species each) (Fig. 4A). CL-EXTs were also not 489

    detected in two families within the Zingiberales (Zingiberaceae and Marantaceae). The commelinid 490

    monocots are a major clade within the larger monocot group (APG IV, 2016). As CL-EXTs are 491

    absent from at least four grass genomes, including the Sanger-sequenced rice genome (see Table I, 492

    Johnson et al., companion paper), and occur widely in the 1KP monocot group (which excludes the 493

    commelinid monocots) (Fig. 2; Supplemental Table S2; Data file 6), we suggest that there has been 494

    selective loss of CL-EXTs within the commelinid monocots. Further studies are required to confirm 495

    their absence from specific commelinid species. 496

    497

    It is likely that the cross-linking function of CL-EXTs would need to be replaced by something else in 498

    commelinid monocots to provide the necessary scaffold for cell wall development. Obvious 499

    candidates are monomeric phenolic acids such as the hydroxycinnamic acids (ferulic and p-coumaric) 500

    of commelinid monocot walls, absent in the non-commelinid monocot, eudicots (with the exception of 501

    the Caryophyllales) and gymnosperm walls, which form covalent cross-links between arabinoxylan 502

    chains (reviewed in Bacic et al., 1988; Harris, 2005). Additional experimentation is needed to test this 503

    hypothesis. 504

    505

    Orthologs of GPI-AGPs and CL-EXTs in Brassicales 506

    The broad sampling of species in the 1KP data provides an opportunity to investigate how far 507

    orthologous relationships of selected sequences can be traced in plant lineages (Wickett et al., 2014). 508

    Given IDPs are usually poorly conserved throughout evolution we wanted to test if putative HRGP 509

    orthologs could be detected. As Arabidopsis HRGP sequences are the most well characterized, we 510

    undertook phylogenetic analysis of Arabidopsis and GPI-AGPs and CL-EXTs identified by MAAB in 511

    the Brassicales (see Methods) (Fig. 5). The maximum likelihood (ML) tree of the GPI-AGPs shows 512

    that sequences from Brassicaceae datasets group most closely with the Arabidopsis sequences, 513

    generally with good (>70%) bootstrap values. Sequences from other Brassicales families cluster 514

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 21

    further away according to their evolutionary relatedness and with lower bootstrap support (Fig. 5A). 515

    Although showing low bootstrap support, AtAGP59, the additional GPI-AGP identified by MAAB 516

    analysis, clusters with AGP6 and AGP11 (Fig. 5A). 517

    518

    The phylogenetic analysis of Brassicales CL-EXTs revealed a similar hierarchical pattern to the GPI-519

    AGPs, with Brassicaceae sequences generally clustering nearer to the Arabidopsis sequences, but with 520

    variable bootstrap support (Fig. 5B). Some CL-EXT 1KP sequences did not group with an 521

    Arabidopsis sequence, making prediction of orthologous relationships difficult, even within the 522

    Brassicaceae. This is perhaps not surprising since tandem repeat proteins often evolve more rapidly 523

    than other proteins (Moesa et al., 2012; van der Lee et al., 2014). The clustering pattern of GPI-AGPs 524

    in the Brassicales tree (Fig. 5A) largely reflects that observed for the Arabidopsis sequences (see Fig. 525

    2C, Johnson et al., companion paper). This suggests that despite the low bootstrap values it may be 526

    possible to identify orthologous sequences using phylogenetic analysis over larger evolutionary 527

    distances, such as within angiosperms (flowering plants). 528

    529

    Detection of orthologs in angiosperms: AtAGP6/11 as an example 530

    One pair of Arabidopsis GPI-AGPs, AtAGP6 and 11 (see sub-clade AGP-h, Fig. 2C, Johnson et al., 531

    companion paper), have been well studied as they are involved in pollen development and pollen tube 532

    growth and therefore impact reproductive capacity (Levitin et al., 2008; Coimbra et al., 2009; Coimbra 533

    et al., 2010). Orthologs of these genes could therefore be expected to be found in a broader subset of 534

    eudicots and potentially early angiosperms, depending on gene conservation. Although a recent 535

    attempt to find putative orthologs of AtAGP6/11 in Quercus suber (cork oak; core eudicots/rosids, 536

    Fagales) pollen EST data was unsuccessful (Costa et al., 2015), we predicted that it should be possible 537

    to identify orthologs in the 1KP data, particularly within datasets that include floral tissue. A 538

    combination of BLAST (Phytozome and NCBI) and profile hidden Markov (HMMER) models (see 539

    Methods) were used to search for putative AtAGP6/11 orthologs in the 1KP multiple k-mer data 540

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 22

    (MAAB input and output, see Data files 2, 3, 4, Johnson et al., companion paper). In some cases, 541

    additional sequences were identified by BLAST from other sources (for example, the Amborella 542

    trichopoda genome) for use as full-length evolutionary markers (see Methods). The HMMER models 543

    were optimized by varying the input sequences and the length of the aligned region (see Methods). 544

    The best results were obtained using full-length AtAGP6/11-type proteins (including the N-terminal 545

    ER and C-terminal GPI-anchor signal sequences) from the Brassicaceae family (model 1; 546

    Supplemental Table S3). This was measured by a higher ‘true’ positive and lower false positive rate, 547

    as assessed by manual curation (Supplemental Table S4; see Methods). 548

    549

    The resulting sequences were subjected to phylogenetic analysis and the ML tree displayed in Fig. 6 550

    shows the relatedness of a large sub-set of AtAGP6/11-like sequences derived from 1KP data and 551

    sequenced genomes. Almost all the eudicot sequences group together with AtAGP6/11. The monocot 552

    and monocot commelinid sequences form a separate sub-clade with AtAGP58 and 59 (Fig. 6), albeit 553

    with low bootstrap support. Notably, no AtAGP6/11-like sequences were detected within 1KP 554

    samples from Fagales, the order that includes Q. suber, including those containing floral tissue 555

    (Supplemental Table S4) despite two similar sequences being identified in the Q. robor genome 556

    (Qurob_scaffold_362 and 554). Only two sequences identified by HMMER model 1 were apparent 557

    false positives (OHAE_12357 and 4810; Polygala lutea, Fabales), positioned in a sub-clade with 558

    Arabidopsis AtAGP1/2/4/5 and OsAGP1 from rice (Fig. 6). The relatively clear separation of putative 559

    AtAGP6/11 orthologs from the other GPI-AGPs suggests that phylogenetic analysis may provide a 560

    reasonable prediction of gene orthology for these IDPs. In this case, potential AtAGP6/11 orthologs 561

    could be confidently detected in both eudicot and monocot lineages, suggesting the existence of an 562

    ancestral gene in a common ancestor prior to their divergence ~140-150 Myr ago (Chaw et al., 2004). 563

    However, no potential AtAGP6/11 ortholog was identified in Amborella trichopoda, which is the 564

    sister group of all other angiosperms. Unfortunately, all eight 1KP datasets from ‘basal’ angiosperms 565

    (members of Amborellaceae, Nymphaeales, Austrobaileyales) were derived from either leaf or shoot 566

    tissue, not floral tissue, and A. trichopoda is the only basal angiosperm genome that is publically 567

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 23

    available. Our demonstrated success at finding more HRGPs using multiple k-mer assemblies (see 568

    Johnson et al., companion paper) suggests that using this approach to analyze existing and future 569

    basal-most angiosperm floral transcriptome datasets would be advantageous before concluding an 570

    absence of AtAGP6/11 orthologs amongst this group. 571

    572

    To explore the robustness of the AtAGP6/11 sub-clade further, all GPI-AGPs from the 1KP datasets 573

    reported in Fig. 6 were used to produce a larger angiosperm GPI-AGP tree (Supplemental Fig. S3). 574

    As expected the bootstrap values were low, however, most of the additional 290 GPI-AGP sequences 575

    grouped into the sub-clades observed in Fig. 2C of our companion paper (see Johnson et al.). These 576

    reflect putative Arabidopsis orthologs for AtAGP17/18, AtAGP9, AtAGP4/7/10, AtAGP2/3, 577

    AtAGP58, AtAGP25/26/27, AtAGP1 and AtAGP6/11/59. Importantly, despite the low bootstrap 578

    support, the majority of 1KP sequences in the AtAGP6/11 sub-clade in Fig. 6 also group with 579

    AtAGP6/11 in the angiosperm GPI-AGP tree (Supplemental Fig. S3), suggesting that these 580

    sequences are indeed the most similar and therefore likely to be AtAGP6/11 orthologs within the 581

    analyzed transcriptomes. 582

    583

    Investigation of the GPI-AGP sequence alignment and further analysis of each individual GPI-AGP 584

    sub-clade revealed differences in the presence and distribution of specific amino acids (e.g. K, M, Q, 585

    and D/E) among sub-clade members (Fig. 7, Supplemental Fig. S4). For example, putative orthologs 586

    of AtAGP6/11 from both eudicots and some monocots share scattered K residues, concentrated in the 587

    first half of the protein followed by a region containing several acidic residues (Fig. 7; Supplemental 588

    Fig. S4K). 589

    590

    The scattered K residues are noticeably absent from the putative orthologs from Q. robur 591

    (Supplemental Fig. S4A,K). Further analysis is therefore required to confirm these sequences, and 592

    other putative AtAG6/11 genes in order to determine the ‘true’ AtAGP6/11 orthologs. With additional 593

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 24

    taxonomic and appropriate tissue sampling, it should be possible to fill the evolutionary gaps and trace 594

    the origin of AtAGP6/11 genes, as well as other HRGPs. Complementation of angiosperm mutant 595

    lines, for example, the agp6 agp11 double mutant of Arabidopsis that displays pollen phenotypes 596

    (Levitin et al., 2008; Coimbra et al., 2009; Coimbra et al., 2010), with progenitor AtAGP6/11 genes 597

    will be essential to test their functional orthology and may reveal information about these motifs, and 598

    possibly the specific glycosylation required for HRGP function in particular tissue types. 599

    600

    Conclusions 601

    HRGPs: remaining challenges and future perspectives 602

    Despite being minor components of the wall, the HRGPs make a significant contribution to wall 603

    properties and to cellular identity (Knox et al., 1991). With access to the 1KP data we have made 604

    significant progress towards understanding the evolutionary timeframe of when this large gene family 605

    of diverse and complex proteins likely evolved. We have provided a platform to guide experimental 606

    approaches to confirm our data and answer questions as to how HRGPs evolved, as well as 607

    investigating the losses of major classes of HRGPs through evolution (e.g. loss of CL-EXTs in 608

    mosses, liverworts and commelinid monocots). 609

    610

    Tracking individual HRGPs throughout evolution is a major challenge. Our investigation of 611

    AtAGP6/11 orthologues shows that it is possible to trace HRGPs with specific characteristics and 612

    determine their evolutionary origins. However, the CL-EXTs in particular, as a consequence of their 613

    variable tandem repeat motifs and diverse sequence length, cannot be aligned in a meaningful way 614

    with the tools developed for folded proteins. Alternate approaches are needed, such as investigating 615

    length variation and indels rather than pairwise alignment. An approach that could be useful is to 616

    perform global multiple sequence alignments using a graph-based method that takes into account the 617

    observation that indels can occur anywhere in a repeat (Szalkowski and Anisimova, 2013). 618

    619

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 25

    The biggest challenge will be to understand the individual and coordinated roles of HRGPs within the 620

    context of either a single cell or tissue type. Our data provide a valuable resource to initiate system 621

    wide analyses to identify spatio-termporal expression changes of HRGPs throughout development. 622

    Even in the less-complex non-vascular plants, GPI-AGPs and CL-EXTs are multigene families, and 623

    understanding their individual function will provide exciting challenges for researchers for many years 624

    to come. Defining the role of individual AGPs and EXTs is confounded by their interactions with 625

    other cell wall polymers; cross-linking of AGPs (Tan et al., 2013) and EXTs to other polymers such as 626

    pectins and heteroxylans has been experimentally verified (reviewed in Velasquez et al., 2012)). 627

    Understanding how widespread cross-linking of HRGPs to other cell wall polymers will be 628

    particularly challenging. The emerging availability of tractable genetic systems for non-model land 629

    plants will allow researchers to use the information provided by MAAB to explore these questions in a 630

    range of species. 631

    632

    Glycosylation is a characteristic feature of HRGPs and defines the interactive molecular surface. 633

    Predicting the post-translational modification of HRGPs including the sites of P-hydroxylation and 634

    types of glycosylation will require detailed structural analysis of many members of each multigene 635

    family, from key transitions throughout the plant lineage. Even within a single species, glycosylation 636

    is tissue-dependent, and directed by the context of amino acids surrounding the glycomotif (Tan et al., 637

    2003; Shimizu et al., 2005; Kurotani and Sakurai, 2015). For many HRGPs, glycosylation often 638

    completely encases and hence masks the protein backbone, as is the case for GPI-AGPs, AG-peptides 639

    and CL-EXTs (see Fig. 1, Johnson et al., companion paper). Furthermore, it also drives the three 640

    dimensional conformation of the protein backbone, and therefore determines the presentation of 641

    epitopes on the molecules surface. Given that plant HRGPs have the same structural complexity as 642

    animal mucins, it is expected that the core protein sequence will influence the extent and type of 643

    glycosylation, which can also vary in a tissue-/cell-specific manner (Gerken et al., 2013). It is known 644

    that HRGP glycosylation differs between volvocine algae and embryophytes, but information on 645

    streptophyte green algae is lacking. Introducing the same HRGP backbone into multiple species and 646

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 26

    tissue types throughout evolution would be an interesting approach to investigate context-dependent 647

    glycosylation patterns and/or function. A complementary investigation of the glycoysltransferases 648

    (GTs) that act to glycosylate HRGPs would also greatly assist our understanding of this complex 649

    process. An exciting future of HRGP research beckons to determine the biological and functional 650

    significance of this fascinating superfamily of secreted glycoproteins. 651

    652

    653

    Methods 654

    Phylogenetic Analysis 655

    Phylogenetic analyses were performed using MEGA 6.06 (Tamura et al., 2013). Sequences 656

    (untrimmed) were first aligned using MUSCLE (Edgar, 2004) using the default settings, except for 657

    Supplemental Fig. S3, gap = -2 (default gap = -2.9). The best model was identified (find best protein 658

    model (ML) option) and used for each alignment to generate a maximum likelihood tree (with 100 659

    bootstrap replications), using all sites in the alignment. Sequences used for phylogenetic analysis 660

    (fasta format), and the best model, are in Supplemental Fig. S5. 661

    662

    Datasets 663

    A summary of datasets and methods is also available at 664

    http://services.plantcell.unimelb.edu.au/hrgp/index.html. 665

    Data analysis was based on 1282 samples downloaded from the official 1KP mirror 666

    (onekp.westgrid.ca; as of March 2014) (see Johnson et al., companion paper). Five additional datasets 667

    were analyzed (July 2015) to improve coverage of hornworts (UCRN-Megaceros_tosanus, FAJB-668

    Paraphymatoceros_hallii, RXRQ-Phaeoceros_carolinianus-sporophyte, WCZB-669

    Phaeoceros_carolinianus-gametophyte) and to include the single sample from clade Ginkgoales 670

    (SGTW-Ginkgo biloba). These additional five datasets are included in the MAAB input and output 671

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 27

    files (see Data file 2, 3, 4, Johnson et al., companion paper) and most of the analysis with these 672

    samples is presented, with the exception of Supplemental Table S2. 673

    674

    MAAB summary analysis 675

    The sequences and associated annotations from MAAB classification, including the assembly in which 676

    they were identified, are provided in Data files 3, 4 (of Johnson et al., companion paper). Data from 677

    individual samples (hereafter referred to as datasets) that were identified by the 1KP consortium as 678

    containing contaminants were removed from analyses (unless otherwise noted) and can be found in 679

    Data file 5 (of Johnson et al., companion paper). The number of sequences in each MAAB class for 680

    each individual 1KP sample is provided in Data file 6. Mean number of HRGPs by MAAB class (for 681

    each 1KP group (Fig. 2C) or commelinid monocot familiy (Fig. 4A and C)) was calculated by 682

    averaging the appropriate data in Data file 6 (that reports the number of sequences per HRGP class 683

    (columns) for each 1KP dataset (rows)). Total numbers of HRGPs by 1KP group (Fig. 1) were also 684

    generated from Data file 6. Graphs (Fig. 4A and C) were generated using R/Bioconductor 685

    (Gentleman et al., 2004). DNA sequences of GPI-AGPs (class 1) and CL-EXTs (class 2) are provided 686

    in Data files 7, 8 respectively. 687

    688

    Reducing redundancy in 1KP datasets 689

    For phylogenetic analyses, redundant sequences were eliminated from datasets. The most probable 690

    sequence was identified based on existing knowledge, as follows: MAAB 1KP output (see Data file 3, 691

    4, Johnson et al., companion paper) was sorted by 1KP locus. Sequences for each dataset (1KP locus) 692

    were then sorted by HRGP class and the sequences from the desired HRGP class pasted into a new 693

    Microsoft Excel sheet. Sequences were sorted alphabetically by sequence (to group sequences with 694

    similar N-termini) and converted to tfasta format. Multiple sequence alignments were performed with 695

    MUSCLE (http://www.ebi.ac.uk/Tools/msa/muscle/) to identify redundant sequences. Where two or 696

    more sequences had large regions of identity, the longer sequence was retained unless there were clear 697

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 28

    artefacts at either the N- or C-terminus. Retained sequences were subjected to further rounds of 698

    alignments to ensure only unique sequences remained. When redundancy was unclear, both sequences 699

    were retained in small datasets but only one for larger datasets. 700

    701

    Removing redundancies from the 1KP sequence data was necessary because like other short-read 702

    sequence data, it offers a wealth of information, but redundancy can occur from multiple sources 703

    including splice variants, sequencing errors, recent duplications, chimeric transcripts and polyploidy 704

    (Gruenheit et al., 2012; Yang and Smith, 2013; Xie et al., 2014). Sequence redundancies were obvious 705

    in our analysis (Supplemental Fig. S1A,B) and two relatively common occurrences were read-706

    through of GC-rich hairpins in the cDNA during reverse transcription and poor-quality sequence at 707

    one end of the paired-read. cDNA synthesis for all 1KP datasets was performed at 42°C (Chen Li, 708

    BGI-Shenzhen, personal communication). 709

    710

    Tandem repeat annotation library (TRAL) construction 711

    TRAL v0.3.5 (Schaper et al., 2015) was used to identify and evaluate tandem repeats present in 1KP 712

    sequences from HRGP class 2 (CL-EXTs), class 3 (PRPs) and class 24 (

  • 29

    clustering using USEARCH (Edgar, 2010) and aligned with MUSCLE v3.81 (Edgar, 2004) for each 724

    model. A tool for browsing and searching TRAL results is available at 725

    http://services.plantcell.unimelb.edu.au/hrgp/index.html, and was used to determine the number of 726

    species with class 2 and class 24 TRAL repeats, using keyword searches by order and/or species (Fig. 727

    3). 728

    729

    BUSCO (Benchmarking Universal Single-Copy Orthologs) analysis 730

    1KP transcriptomes were assessed for completeness using the BUSCO analysis tool (Simao et al., 731

    2015). The BUSCO plant dataset and software were downloaded from http://buscos.ezlab.org/. 732

    733

    Profile hidden Markov (HMMER) models for finding putative AtAGP6/11 orthologs 734

    Protein sequences were aligned with MUSCLE v3.81 (Edgar, 2004) using default parameter settings. 735

    Sequence alignments were then used as input into HMMER v3.1b1 hmmbuild (http://hmmer.org/) to 736

    construct a profile HMM. Subsequently, hmmsearch was used (with default thresholds) to search 737

    against MAAB 1KP input sequences (see Data file 2, Johnson et al., companion paper). Seven 738

    different models were evaluated (Supplemental Table S3) and the results of the best model (see Data 739

    file 4 Johnson et al., companion paper) are summarized in Supplemental Table S4. For each positive 740

    1KP dataset a single ‘best’ sequence was selected manually based on presence of N-terminal ER 741

    signal, full or partial C-terminal GPI-signal sequences and then checked by iterative multiple sequence 742

    alignment and phylogenetic analysis. To distinguish between true class 4 non-GPI-AGPs and 743

    potential partial GPI-AGPs, proteins needed to have the following GPI-anchor signal characteristics: a 744

    potential omega (ω), ω+1 and ω+2 sites and a basic residue and/or a hydrophobic domain) (Eisenhaber 745

    et al., 2003), unless otherwise noted (e.g. ZHMB_16956, XMVD_14518), as summarized in 746

    Supplemental Table S4. 747

    748

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 30

    Additional AtAGP6/11-like sequences were obtained from other sources (outlined below) for use as 749

    full-length controls and to increase the sequence representation in various 1KP groups. TBLASTn 750

    was used to search Phytozome and/or the nucleotide collection (NR/NT at NCBI) using word size 2, 751

    no filtering, and restricting organisms to same species, genera or family as appropriate, with either 752

    AtAGP6/11 or OsAGP7/10 as query sequences. If unsuccessful, a suitable 1KP sequence was selected 753

    (coding sequence only, Data file 7) and BLASTn performed (word size 7 and no filtering). When 754

    positive hits were identified, these were used as query sequences to find additional sequences 755

    (maximum of two per species used). Q. robur AtAGP6/11 putative orthologs were identified in 756

    TBLASTn searches of Q. robur assembly v1 scaffolds available at https://urgi.versailles.inra.fr/blast/. 757

    For the basal angiosperm Amborella trichopoda (not present in Phytozome v9) GPI-AGPs were 758

    identified amongst annotated proteins accessible at ftp://ftp.ensemblgenomes.org/pub/plants/release-759

    31/fasta/amborella_trichopoda/ using 50% PAST as outlined in Schultz et al. (2002). A similar 760

    approach was used to find evidence for genomic sequences of CL-EXTs from selected genomes using 761

    1KP-identified CL-EXTs as query sequences (Data file 8). For example, 1KP sequence 762

    AYMT_Locus_34756 identified partial CL-EXT sequence E. grandis scaffold_6: 763

    36791844..36792375, and VGHH_Locus_4377 identified A. coerulea scaffold_34: 764

    1845754..1846286). 765

    766

    Data access & Datasets 767

    Additional datasets for this paper. Numbering continues from Johnson et al., companion paper, and a 768

    full list of data files is provided there. 769

    Data file 6. Mean number of HRGPs in each HRGP class (columns) for each 1KP dataset (rows), 770 calculated from MAAB output files (Data files 3, 4). Filename: SA003_MAAB-hits-summary-by-771

    class-and-sample.xls. 772

    Data file 7. DNA sequences for 1KP GPI-AGPs (class 1) detected by MAAB. Locus ID (Data files 773 3, 4) of class 1 GPI-AGPs was used to extract the DNA sequence from the appropriate multiple k-mer 774

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 31

    assembly. Where more than one locus ID is reported for a single protein (e.g. with different k-mers), 775

    only one DNA sequence is reported. Filename: 1kp_agp_incl_dnas_9-12-2015.xls. 776

    Data file 8. DNA sequences for 1KP CL-EXTs (class 2) detected by MAAB. Locus ID (Data files 3, 777

    4) of class 2 CL-EXTs was used to extract the DNA sequence from the appropriate multiple k-mer 778 assembly. Where more than one locus ID is reported for a single protein (e.g. with different k-mers), 779

    only one DNA sequence is reported. Filename: class2_incl_dnas_9-12-2015.xls. 780

    Data file 9. Sequences identified by HMMER model 1 (putative AtAGP6/11 orthologs). Filename: 781

    agp6_model1_hmm_hits_20150922.xls. 782

    783

    Supplemental Data 784 785

    Supplemental Table S1. Summary of selected AGP and extensin epitopes described in the literature. 786

    787

    Supplemental Table S2. HRGP detection rate for each 1KP taxonomic order. 788

    789

    Supplemental Table S3. Summary of HMMER models used to identify putative orthologs of 790

    AtAGP6 and AtAGP11. 791

    792

    Supplemental Table S4. Summary of HMMER model performance by 1KP sample and analysis of 793

    sequences selected for phylogenetic analysis. 794

    795

    Supplemental Fig. S1. Number of GPI-AGPs and CL-EXTs by tissue in Ranunculaceae and 796

    Papavaceae samples represented in 1KP. 797

    798

    Supplemental Fig. S2. Bryophyte CL-EXTs sequences. 799

    800

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 32

    Supplemental Fig. S3. Phylogenetic analysis (ML-tree) of GPI-AGPs (all sub-clades) from selected 801

    angiosperms. 802

    803

    Supplemental Fig. S4. Conserved features for putative orthologs of selected Arabidopsis GPI-AGPs. 804

    Supplemental Fig. S5. Sequences used for phylogenetic analysis (in fasta format). 805

    806

    Acknowledgements 807

    The authors thank Mark Chase from the Kew Royal Botanic Gardens, Markus Ruhsam and Philip 808

    Thomas at the Royal Botanic Garden Edinburgh, and Steven Joya from the University of British 809

    Columbia for supplying plant samples to 1KP. We also thank Drs Birgit and Frank Eisenhaber, 810

    Bioinformatics Institute A*Star Singapore for providing us with a standalone version of big-PI plant 811

    predictor. 812

    813

    Figure Legends 814

    Figure 1. Number of HRGP sequences identified by MAAB in each HRGP class by 1KP group. 815

    The number of datasets per clade is shown in brackets after the 1KP group name. All MAAB classes 816

    are reported for multiple k-mers (multk: 39,49,59,69) (unshaded columns). No sequences were 817

    detected for HRGP class 14. Selected data is reported for assemblies with k-mer = 25 (k25, shaded 818

    column). Red text indicates that the sequences detected within this class and 1KP group are likely to 819

    be contaminants (see Results/Discussion). 820

    821

    Figure 2. Summary of HRGP sequences identified using the MAAB pipeline. A, Percentage of 822

    total HRGP sequences (by 1KP group) that are found in the classical HRGP classes (1 to 4), hybrid 823

    HRGP classes (5 to 23) and non-HRGPs (class 24). B, Number of orders analyzed and number of 824

    samples per order for each 1KP group. The monocots group excludes the commelinid monocots. C, 825

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 33

    Overview of the mean number of HRGPs for each MAAB class in the 1KP dataset (by 1KP group). 826

    Shading of boxes represents the detection rate, indicated as the % of orders with hits (see 827

    Supplemental Table S1). A shaded box showing detection of HRGPs with a value of zero indicates 828

    an average number of sequences between 0-0.1. Red text indicates that the sequences detected within 829

    this HRGP class and 1KP group are likely to be contaminants (see Results/Discussion). 830

    831

    Figure 3. Distribution and number of CL-EXTs (class 2) in the bryophytes. MAAB detected CL-832 EXTs in all hornwort species in the 1KP data whereas the distribution in mosses and liverworts shows 833

    either loss of, or multiple independent gains of, CL-EXTs. Tandem Repeat Annotation Libraries 834

    (TRAL) of repeats identified in class 2 and class 24 (control) sequences were used to search the 835

    MAAB input sequences and identify repeats present in the bryophyte lineages. There is good 836

    correlation between the species with CL-EXTs identified by MAAB and those with SPn/Y-based 837

    repeats identified by TRAL using class 2 repeats. No CL-EXTs were identified using TRAL repeats 838

    from class 24 sequences. The phylogenetic tree is based on Cole and Hilger (2013) and was selected 839

    because it includes all bryophyte orders. 840

    841

    Figure 4. Mean number of MAAB class 2 CL-EXT sequences (A) identified using the MAAB 842 pipeline within each family of monocots/commelinids. Many families lack CL-EXTs but this is not 843

    due to poor quality transcriptomes as BUSCO analysis (B) shows the percentage of 956 single-copy 844

    orthologous genes found in the monocots/commelinids families is high (avg. ~80%) Family data (k-845

    mer 39 only) are summarized in a box-and-whisker plot after removal of outliers (2 Poaceae sp.: 59.4, 846

    65.7; 1 Aracaeae sp.: 18.1). The mean number of class 1 GPI-AGPs (C) was evaluated and these are 847

    present in all families that lack CL-EXTs. Only Cannaceae and Restionaceae have both GPI-AGPs 848

    and CL-EXTs, although for most families the number of datasets is low (1 to 3). The total number of 849

    datasets per family is indicated in parentheses. 850

    851

    Figure 5. ML tree of Brassicales 1KP and Arabidopsis GPI-AGPs (A), and CL-EXTs (B). (A,B) 852 ML trees generally show strong support for sub-clades with Arabidopsis sequences and 1KP 853

    sequences from family Brassicaceae. Putative GPI-AGP sub-clades (A) are separated by a horizontal 854

    dotted line and the Arabidopsis ortholog(s) are indicated by boxes (right of tree). Sequences from the 855

    Brassicaceae family are in green with Arabidopsis sequences in larger, bold font. AtEXT3 was 856

    included with CL-EXTs (B) even though it was classified as class 20 because of its shared bias (

  • 34

    orange, 40-59 black). 1KP datasets (1KP locus ID and family name) are shown on the Brassicales tree 860

    (inset) (Stevens (2001), Version 12, July 2012 http://www.mobot.org/mobot/research/apweb/). Scale 861

    bars for branch length measures the number of substitutions per site. Sequences: GPI-AGPs 862

    Brassicales (Supplemental Fig. S5A) and CL-EXTs Brassicales (Supplemental Fig. S5B). 863

    864

    Figure 6. Phylogenetic analysis of class 1 GPI-AGPs to identify AtAGP6/11 orthologs. ML tree 865

    (MEGA) was constructed using putative AtAGP6/11 orthologs identified predominantly from 1KP 866

    transcriptomes and HMMER model 1 (see Supplemental Table S3), with a few sequences from other 867 sources to increase the breadth of sampling (see Methods; Supplemental Table S4). The tree also 868

    includes at least one sequence from each Arabidopsis GPI-AGP sub-clade, representative rice GPI-869

    AGPs identified by MAAB (see Fig. 2, Johnson et al., companion paper) and four GPI-AGPs from 870

    Amborella trichopoda (Amtri_ERM96654, Amtri_ERM95342, Amtri_ERN01113 and 871

    Amtri_ERN06202). Sequence names are coloured by 1KP group: basal angiosperms (grey), non-872

    commelinid monocots (purple), commelinid monocots (pink), basal eudicots (aqua), core eudicots 873

    (green), asterids (red) and rosids (orange). Numbers on the nodes represent support with 100 bootstrap 874

    replicates (≥70 green, 60-69 orange, 40-59 black). 1KP sequences are identified by Group_Order_1KP 875

    identifier_sequence locus. Genomic sequences and other sequences from NCBI follow a similar 876

    format, but include a five letter abbreviation of genus and species (see Methods). Symbols next to 877 sequence names are used to indicate the source and MAAB class of sequences and other relevant 878

    information. An asterisk (*) after the sequence name indicates additional information is provided in 879

    Supplemental Table S4, for example, the YNXR (Poales) dataset is contaminated with Asparagales, 880 however, the ML tree suggests these sequences are indeed Poales. Scale bar for branch length 881

    measures the number of substitutions per site. 882

    883

    Figure 7. Schematic representation of the GPI-AGP sub-clades showing the presence and 884

    distribution of specific amino acids in the PAST-rich protein backbone. Order of GPI-AGPs is 885 based on clades in angiosperm GPI-AGP tree (Supplemental Fig. S3); figure part numbers (A to K) 886 are based on alignments (Supplemental Fig. S4) and the AtAGP5/10 (G) subclade is included for 887 comparison (based on AGP-c subclade (see Fig. 2C, Johnson et al., companion paper)). A, Putative 888

    orthologs of Lys-rich AtAGP17/18 have scattered lysine (K) residues throughout the sequence and 889

    contain a short K-rich domain near the C-terminus. The glycomotifs are predominantly XP1-2. B, The 890

    AGP9 sub-clade also has a short Lys-rich region with ≥3 K residues and scattered K residues, 891

    however, the glycomotifs are distinct from AtAGP17/18, being predominantly [S/T]P3. AGP4/7/10 892

    (C), AGP2/3 (D), AGP5/10 (G), AGP5 (H) and AGP1(I) sub-clades have a ‘classical’ PAST-rich 893

    backbone with no obvious bias towards other amino acids. E, Most of the putative orthologs of 894

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 35

    AtAGP58 are relatively methionine (M)-rich, with scattered M residues between AGP glycomotifs 895

    and contain a Q residue at the presumed N-terminus. This Q residue at the presumed mature N-896

    terminus is also found in putative orthologs of Lys-rich AtAGP17/18 (A), AtAGP9 (B), 897

    AtAGP1/2/3/4/7/5/10 (C,D,G,H,I), and AtAGP58 (E), but not AtAGP25/26/27 (F), or AGP6/11/59 898

    (J/K). J/K, Putative orthologs of AtAGP6/11/59 share scattered K residues, concentrated in the first 899

    half of the protein and also a small cluster of acidic residues, either aspartic acid (D) or glutamic acid 900

    (E). 901

    902

    903

    904

    905

    Literature Cited 906

    Adair WS, Hwang C, Goodenough UW (1983) Identification and visualization of the sexual agglutinin 907 from the mating-type plus flagellar membrane of Chlamydomonas. Cell 33: 183-193 908

    APG IV (2016) An update of the Angiosperm Phylogeny Group classification for the orders and 909 families of flowering plants: APG IV. Botanical Journal of the Linnean Society 181: 1-20 910

    Averyhart-Fullard V, Datta K, Marcus A (1988) A hydroxyproline-rich protein in the soybean cell 911 wall. Proc. Natl. Acad. Sci. USA 85: 1082-1085 912

    Bacic A, Harris PJ, Stone BA (1988) Structure and function of plant cell walls. In J Priess, ed, The 913 Biochemistry of Plants, Vol 14. Academic Press, New York, pp 297-371 914

    Basile DV, Basile MR (1987) The Occurrence of Cell Wall-Associated Arabinogalactan Proteins in the 915 Hepaticae. Bryologist 90: 401-404 916

    Bernhardt C, Tierney ML (2000) Expression of AtPRP3, a proline-rich structural cell wall protein from 917 Arabidopsis, is regulated by cell-type-specific developmental pathways involved in root hair 918 formation. Plant Physiology 122: 705-714 919

    Biegert A, Soding J (2008) De novo identification of highly diverged protein repeats by probabilistic 920 consistency. Bioinformatics 24: 807-814 921

    Bollig K, Lamshoft M, Schweimer K, Marner FJ, Budzikiewicz H, Waffenschmidt S (2007) Structural 922 analysis of linear hydroxyproline-bound O-glycans of Chlamydomonas reinhardtii--923 conservation of the inner core in Chlamydomonas and land plants. Carbohydr Res 342: 2557-924 2566 925

    Borner GHH, Sherrier DJ, Weimar T, Michaelson LV, Hawkins ND, MacAskill A, Napier JA, Beale 926 MH, Lilley KS, Dupree P (2005) Analysis of detergent-resistant membranes in Arabidopsis. 927 Evidence for plasma membrane lipids rafts. Plant Physiology 137: 104-116 928

    Bowman JL (2013) Walkabout on the long branches of plant evolution. Current Opinion in Plant 929 Biology 16: 70-77 930

    Cannon MC, Terneus K, Hall Q, Tan L, Wang Y, Wegenhart BL, Chen L, Lamport DTA, Chen Y, 931 Kieliszewski MJ (2008) Self-assembly of the plant cell wall requires an extensin scaffold. 932 Proceedings of the National Academy of Sciences, USA 105: 2226-2231 933

    Carpita NC, Gibeaut DM (1993) Structural models of primary cell walls in flowering plants: 934 consistency of molecular structure with the physical properties of the walls during growth. 935 Plant Journal 3: 1-30 936

    www.plantphysiol.orgon March 22, 2020 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

    http://www.plantphysiol.org

  • 36

    Chaw SM, Chang CC, Chen HL, Li WH (2004) Dating the monocot-dicot divergence and the origin of 937 core eudicots using whole chloroplast genomes. Journal of Molecular Evolution 58: 424-441 938

    Chen J, and Varner JE (1985) Isolation and characterization of cDNA clones for carrot extensin and a 939 proline-rich 33-kDa protein. Proc Natl Acad Sci USA 82: 4399-4403 940

    Choudhary P, Saha P, Ray T, Tang Y, Yang D, Cannon MC (2015) EXTENSIN18 is required for full male 941 fertility as well as normal vegetative growth in Arabidopsis. Front Plant Sci 6: 553 942

    Coimbra S, Costa M, Jones B, Men