Plant genome and transcriptome annotations: from ... · Popular transcrip-tome assembly tools such as TRINITY [8] require significant op-timization to produce an assembly of reasonable

Plant genome and transcriptome annotations

from misconceptions to simple solutionsMarie E Bolger Borjana Arsova and Bjorn UsadelCorresponding author Bjorn Usadel RWTH Aachen BioSC Biology I Forschungszentrum Julich Wilhelm Johnen Str Julich 52428 Germany Tel thorn49 (0)241 8026767 E-mail usadelbio1rwth-aachende

Abstract

Next-generation sequencing has triggered an explosion of available genomic and transcriptomic resources in the plantsciences Although genome and transcriptome sequencing has become orders of magnitudes cheaper and more efficientoften the functional annotation process is lagging behind This might be hampered by the lack of a comprehensive enumer-ation of simple-to-use tools available to the plant researcher In this comprehensive review we present (i) typical ontologiesto be used in the plant sciences (ii) useful databases and resources used for functional annotation (iii) what to expect froman annotated plant genome (iv) an automated annotation pipeline and (v) a recipe and reference chart outlining typicalsteps used to annotate plant genomestranscriptomes using publicly available resources

Key words plant genome annotation plant ontologies plant gene family databases genome annotation pipelines

Introduction

Next-generation sequencing has triggered an explosion of avail-able genomic and transcriptomic resources in the plant sciences[1] Since the genome sequence of the model plant Arabidopsisthaliana was published in 2000 [2] around 180 plant genome se-quences have been published (httpwwwplabipddeportalsequence-timeline httpsenwikipediaorgwikiList_of_sequenced_plant_genomes) This number is greatly enhanced by includ-ing plant transcriptome assemblies As of August 2016 the tran-scriptome shotgun assembly database of the National Centerfor Biotechnology Information (NCBI) lists over 450 plant assem-blies (httpswwwncbinlmnihgovTraceswgsviewfrac14TSA)whereas the plant 1KP project alone (onekpcom) includesgt1300 plant transcriptomes This is further complemented bycountless plant transcriptomes found in SupplementalMaterials This remarkable surge is a testament to the genomicsrevolution that has provided us with the tools to quickly se-quence whole transcriptomes on a relatively modest budget

which typically can yield sufficient data for a working qualitytranscriptomic inventory (lsquothe transcriptomersquo)

Generating an assembly for a species is merely the first stepin the elucidation of the genome Extensive processing and ana-lysis is necessary before the resource will yield scientific in-sights In the case of genome assemblies a process calledstructural annotation is necessary This process detects genesincluding their exonintron structures within a given assemblyAlthough this can rely on extensive lsquoextrinsic evidencersquo in theform of RNA sequence (RNASeq) [3] this is often complementedby sophisticated statistical models of gene structures to findexonintron structures in what is termed ab initio discoveryThis process has been covered in detail and readers are referredto [4] Current popular tools to structurally annotate a genomeinclude the automated pipelines MAKER-P specifically de-veloped for plants [5] and the generalist BRAKER1 [6]

Assembling RNASeq data to produce high-quality transcrip-tome assemblies as a shortcut to a lsquofunctional genomersquo [7] is stillnot a trivial task despite these data sets being typically smaller

Marie Bolger is a postdoctoral fellow at the Forschungszentrum Julich (Julich Germany) working on the German Plant Primary DatabaseBorjana Arsova is an FNRS Researcher at the Department of Life Sciences University of Liege (Liege Belgium) and currently a guest scientist at IBG-2Plant Sciences at the Forschungszentrum Julich (Julich Germany) She uses proteomics for the study of nutrient uptake in plantsBjorn Usadel is a professor at the RWTH Aachen University (Aachen Germany) and Director at the IBG-2 Plant Sciences at the Forschungszentrum Julich(Julich Germany) where he leads the bioinformatics group He is also core group leader at the BioEconomy Science Center BioSCde where he is mostlyinvestigating plant gene function and cell wall decompositionSubmitted 22 August 2016 Received (in revised form) 29 November 2016

VC The Author 2017 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40)which permits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

Briefings in Bioinformatics 2017 1ndash13

doi 101093bibbbw135Paper

and consisting uniquely of gene rich data Popular transcrip-tome assembly tools such as TRINITY [8] require significant op-timization to produce an assembly of reasonable quality Forrecipes and cookbooks in the plant field one can refer to [9ndash11]

Once the gene structures have been detected the necessarynext step is to ascribe biological function to the genes in a pro-cess known as functional annotation Surprisingly performingthis task to a degree of accuracy remains challenging despitethe extensive accrual of knowledge about gene function inmodel and crop species Indeed there is still a large percentageof genes many of which can be found across multiple specieswhose function has not been ascertained

Within the plant community A thaliana remains the bestannotated plant largely because of the tremendous effort of TheArabidopsis Information Resource (TAIR) which integratescommunity-based curations together with annotations from lit-erature evidence Over 2800 experimentally supported annota-tions have been included within the past 2 years alone [12] Thiswealth of data has been adopted and further augmented byAraPort [13] an open-source resource which encourages thecommunity to contribute not only data modules but visualiza-tion tools and apps Despite these extensive resourcespublished data [14] indicate that still only about 77 of the pro-tein-coding sequences could be assigned any kind of structuredannotation This figure is in agreement with data from thePLAZA database an online platform that has processed the an-notations from several plant species into a uniform format [15]

Controlled terms and vocabularies for plantfunctional annotations

Homology-based functional annotation is the transfer of exist-ing knowledge about a gene sequence to another gene sequencewithin the same species or to another species This process es-sentially depends on the existing knowledge about a gene func-tion being transferable to genes of a similar sequence andassumes that this similarity reflects functional homologyAlthough an experimentalist working with a non-model speciesmay likely be content with an annotation such as lsquoquite similarto an Arabidopsis thaliana malic enzymersquo this annotation bearsseveral difficulties for a sustainable annotation frameworkThis also hampers structured analysis of genome-wide data toanswer questions like lsquohow many genes are involved in photo-synthesis or glycolysisrsquo This problem can largely be alleviatedby using controlled vocabularies and functional ontologies [16]to provide a consistent description of gene products across dif-ferent species

The Gene Ontology ontology

The most widely used functional annotation is lsquoGene Ontologyrsquo(GO) that provides defined lsquoGO termsrsquo to enable gene productsto be described by three separate domains lsquoBiological Processrsquowhich describes the gene in terms of a recognized series ofevents or molecular functions lsquoCellular Componentrsquo describingthe location of a protein (or rather biomolecule) at a cellularandor macromolecular level and lsquoMolecular Functionrsquo describ-ing the jobs or abilities that a gene product has on the molecularlevel Besides GO terms each GO annotation contains an lsquoevi-dence codersquo which provides information on how a GO term wasapplied to a gene Evidence codes indicate whether the annota-tion is based on experimental evidence computational analysisauthor statements or curatorial statements all of which aremanually curated GO annotations also contain evidence codes

which are used to indicate assignment by automaticcomputa-tional methods This has the advantage that annotations basedon experimental data can be treated with much higherconfidence than automatic annotations of related proteinsIn addition by qualifying where such an annotation came fromit is easier to check the respective annotation In this respectcurated GO resources such as the one for A thaliana representan invaluable resource

The GO ontology is structured as a directed acyclical graphmaking it possible to infer more general terms from a specificterm This additionally allows grouping data eg our malicenzyme might be annotated with the GO term lsquoGO0009763rsquolsquoNAD-malic enzyme C4 photosynthesisrsquo from which one couldimmediately deduce using eg the Amigo browser [17] that theterms lsquoGO0015979rsquo lsquophotosynthesisrsquo and lsquoGO0015977rsquo lsquocarbonfixationrsquo also apply

The Kyoto Encyclopedia of Genes and Genomes ontology

Another widely used resource is the Kyoto Encyclopedia ofGenes and Genomes (KEGG httpwwwkeggjp) This featuresa number of databases that aim to link genomic- and molecu-lar-level information to higher-level functions of the cell organ-ism and the ecosystem Annotation with KEGG is based onassociating molecular function with orthologous groups whichare defined based on clustering of genes from completed gen-omes (currently gt4000 genomes) using the KEGGrsquos internallsquoKEGG Orthology and Links Annotationrsquo (KOALA) program Theresulting information is stored in the lsquoKEGG Orthologyrsquo (KO)database and assignment of KO entry identifier (also called Knumbers) provides the gene annotation KEGG aims to includereference to primary literature for each KO entry (76 of around19 000 KO entries contains this as of September 2015) [18]

CYC and other metabolic resources

In the case of enzymatic reactions there is also the CYC net-work whose plant section is under the Plant Metabolic network(httpwwwplantcycorg) umbrella [19] This is mainly used todescribe enzymatic functions and while it enables one to buildreaction networks [20] it does not cover additional functionalterms The plant Reactome is a database of plant metabolic andregulatory pathways which have been curated for the referencespecies rice and applied to 58 other plant species [21] Finallythe Enzyme Commission numbers [22] (httpwwwchemqmulacukiubmbenzyme) describe reactions and classify enzymeswhich are also referenced by KEGG CYC and ReactomeAlthough CYC uses citations to the primary literature exten-sively the enzymes within CYC are not formally linked to anno-tations via evidence codes

The MapMan BIN ontology

One additional ontology resource that is specifically plantfocused is the MapMan BIN ontology This was originallydevised to visualize omics data on plant pathways [23] but hasgrown since and currently comprises around 2000 ontologicalterms The MapMan ontology is modeled in a hierarchical treestructure with higher-level categories based on biological pro-cess and leaf categories containing detailed function The struc-ture was manually defined by experts in the respective fieldsand changes are applied periodically based on primary litera-ture Although MapMan endeavors to assign lsquoevidencersquo to theBINs (httpmapmangabipdorgwebguestmapcave) these arecurrently updated as new releases are published As this

2 | Bolger et al

ontology is strictly plant-specific it lacks non-plant terms fea-tured by eg GO KO and the CYC databases

Common misconceptions

Functional annotation usually depends on the transfer of func-tional knowledge from one gene to another This assumes thatthe initial functional annotation is not only correct but also of alsquorobust naturersquo to allow transfer There are many pitfalls whichcan occur during this transfer process and which may ultim-ately lead to either incomplete or missannotation of genes

The lsquoannotatablersquo gene space in plants

The number of annotated genes in an assembly is a frequentlyused assessment in published data [24] Before one can assessthe results of this one needs to first know how many genes canbe annotated This question is far from easy to answer as itvaries not only between species but also varies depending onwhat is considered as a lsquohigh confidencersquo annotation The use ofthe ontologies mentioned previously highlights that the func-tion of many genes remain lsquodark matterrsquo The data shown inFigure 1 give an upper bound based on the best annotated Athaliana genome When one considers annotations pertaining toa molecular function or biological process separately slightlygt50 of the A thaliana genes can be assigned a GO function(Figure 1A) Even when asking whether a gene has a molecular

function or a biological process annotated to it in our test datathe number reached 64 (Figure 1B) The numbers are naturallylower when one only considers experimental evidence data andnot electronic annotations which are often based on homologytransfer Only in the case of lsquocellular componentrsquo are thesenumbers much higher (Figure 1A) as the subcellular localiza-tion can usually be predicted easily as shown in the sectionlsquoSubcellular localizationrsquo below

Put in other words this means that obtaining functional an-notations (based on Molecular function and Biological processFigure 1B) for more than two-thirds of the plant protein-codinggenes analyzed is relatively unlikely and a number much lowerthan this could suggest an incomplete genome ortranscriptome

Annotation quality can vary

Even in cases when genes have been successfully annotatedthe question about the quality of the annotations needs to beaddressed One simple pitfall is to take sequence similarity toannotated proteins at face value Indeed any functional anno-tation derived by simple sequence similarity transfer should bescrutinized carefully before embarking on a particular hypoth-esis about this particular protein Given that proteins generallyconsist of one or more distinct domains embedded in genericregions annotations that only look at sequence similarity but

Figure 1 Overview of the number of annotated genes for the genome of the model plant A thaliana based on analysis of GO terms

The GoSlim annotations were downloaded from the TAIR Web site (ftpftparabidopsisorgOntologiesGene_OntologyATH_GO_GOSLIMtxtmdashdownloaded July 2016)

For each of the three main GO domains the respective annotations were categorized according to the evidence code The lsquoExperimentalrsquo category includes genes anno-

tated with evidence codes IDA (inferred from direct assay) IMP (inferred from mutant phenotype) IGI (inferred from genetic interaction) IPI (inferred from physical

interaction) or IEP (inferred from expression profile) lsquoCuratedrsquo includes those which had evidence codes IC (Inferred by Curator) NAS (Non-traceable Author

Statement) and TAS (Traceable Author Statement) but lacking any annotation covered by the lsquoExperimentalrsquo category lsquoElectronicrsquo includes genes annotated with evi-

dence codes ISS (Inferred from Sequence or Structural Similarity) ISO (Inferred from Sequence Orthology) ISM (Inferred from Sequence Model) IBA (Inferred from

Biological Aspect of Ancestor) RCA (Inferred from Reviewed Computational Analysis) or IEA (Inferred from Electronic Annotation) but lacking any annotation from the

lsquoExperimentalrsquo or lsquoCuratedrsquo categories (A) The three aspects are shown separately (B) The best annotation from multiple domains is shown with the combination of

Molecular Function and Biological Process on the left and all three domains combined on the right

Plant genome and transcriptome annotations | 3

do not take into account that certain domains are necessary toexert a function might lead to an incorrect annotation

Absence of annotation does not mean absenceof function

Furthermore absence of a specific annotated gene in a plantgenometranscriptome does not necessarily mean that theplant cannot perform a particular function Functional annota-tion is highly dependent on complete gene models so in casesof partial or incomplete gene models as is frequently seen withtranscriptome assemblies the tools used might not be sensitiveenough to ascribe (the correct) function on a partially assembledgene Thus caution needs to be exercised when posing hypoth-eses based on gene or even pathway loss Such scenarios needto be carefully validated using manual approaches A first stepwould be to analyze the genometranscriptome specifically forthis function by using eg BLAST [25] or searching for a neces-sary domain using eg HMMER3 [26] using the resources listedin Table 1 In the case of no good candidates more sophisticatedand even experimental methods would need to be used to dem-onstrate the absence of a gene function beyond reasonabledoubt

In conclusion one should keep in mind that functional an-notations should be treated with care and taken as workinghypotheses that might or might not need to be verified by biolo-gical experimentation

Functional annotations using generic tools andad hoc pipelines

Given the current levels of plant genome annotation it is per-haps unsurprising that frequently the sole annotation processused is based on sequence similarity to the well-annotatedplant A thaliana Indeed often a simple BLAST search is per-formed using the genometranscriptome as a query and the A

thaliana proteome as a subject This is because of the well-maintained and annotated A thaliana genome In addition to Athaliana a selection of plant protein reference files can be ob-tained from Phytozome [27] andor Ensembl Plants [28] withmanually curated data sets for all species available fromUniProtKBSwiss-Prot [29]

Many functional annotation tools require that the input dataare protein sequences and some tools which can accept eithernucleotide or protein sequence show superior results whenprotein sequences are submitted Thus extracting high-qualityprotein sequences is often the first step in functionalannotation The genome structural annotation pipeline fromAUGUSTUSBRAKER1 [3] provides auxiliary scripts (httpaugustusgobicsdebinariesscripts) which will conveniently out-put the protein sequences after genome annotation into aFASTA file

Finding coding regions in transcriptome assemblies

De novo transcriptome assemblies however pose additionalchallenges as coding sequences need to be identified andframeshift mutations corrected before protein conversionESTScan [30 31] a program which can detect coding sequencesin DNA has been developed to perform this task but needs to betrained with examples before it is used on a specific data setThis program exploits bias in nucleotide usage found in codingsequences relative to noncoding sequences Other heuristicssuch as identifying the longest open reading frame (ORF) or bysearching for frames that code for functional domains usingTransDecoder (httpsgithubcomTransDecoder) [32] presentalternative approaches FrameDP [33] and GenemarkS-T [34]perform a similar function but use sophisticated methodswhich remove the need for the training steps FrameDP was de-veloped to discover coding sequences in transcripts or tran-script fragments such as ESTs and is part of the TRAPID [35]integrated tool (discussed further below) GenmarkS-T providesan algorithm which is somewhat robust against assemblyerrors and has been shown to compare favorably with otherexisting tools Despite showing superior performance whentested by the authors against Transdecoder and ESTScan theauthors noted problems arising when RNASeq-based assem-blies gave rise to the transcript models This is because theunderlying transcript models contained multiple errors leadingto concomitant problems in coding region finding [34]

Sequencing errors carried over from the assembly to theannotation process might create artificial amino acid mutationsor insert stop codons in ORFs shortening existing or creatingnon-expressed peptides Proteomics experiments are vital in ex-perimentally validating gene models originating from

Table 1 Available resources for protein family- or domain-based functional identifications

Resource Version Families Web address Comments

PFAM 300 16 306 httppfamxfamorgTIGRFAM 150 4488 httpwwwjcviorgcgi-bin

tigrfamsindexcgiPANTHER 110 13 096 httppantherdborgSMART 71 1312 httpsmartembl-heidelbergde License necessaryEggNOG 45 190 648 (37 127 plants) httpeggnogdbembldeapphomeINTERPROSCAN 580 gt40 000 integrated

entrieshttpswwwebiacuk

interprosearchsequence-searchMeta engine including all other

resources except EggNOG but notnecessarily the most recentversion at all times

CDD 315 52 411 (11 474from CDD curation)

httpwwwncbinlmnihgovcdd Uses RPS-BLAST and includespartly older versions of PFAMSMART and TIGRFAM

4 | Bolger et al

transcriptome assemblies by comparing the expressedmeas-ured peptides with the in silico database as described in [36 37]However functional annotation can frequently deal with an in-accurate ORF as long as most of the true coding region is re-tained This is because similarities can still be identified basedon slightly truncated regions

Annotation based on profile hidden Markov models

Tools that specialize in identifying domains within a sequencehave advantages over simple similarity comparisons as domain

sequences typically are highly conserved between genesDomains are frequently represented as profile hidden Markovmodels (HMMs) which are deduced from multiple sequencealignments stemming from several species thus capturing typ-ical sequence diversity at individual residues This provides amore sensitive way to approach the sequence annotation prob-lem Table 1 provides a list of the main tools which use proteinfamily models often in the form of profile HMMs PFAM [38] islikely the best known resource in this area and can currentlyidentify gt16 000 families TIGRFAM [39] is a manually curatedresource which provides HMMs for full-length proteins andshorter regions PANTHERrsquos [40] distinguishing feature is that itsplits families into subfamilies allowing for a fine-grained anno-tation SMART focuses on regulatory domains which are oftenmore difficult to tackle [41] Finally the EggNOG database [42]provides access to precomputed orthologous groups includingplant-specific ones along with functional annotations

Even though these resources do not necessarily attribute aspecific function to a protein they do provide valuable evidenceor hints toward the function of the protein In addition to thesestandalone resources HMMs used by many of these tools canbe downloaded (in some cases after having applied for a li-cense) and used with the HMMER software suite [26]

The integrated InterProScan resource

Many of the protein family databases mentioned in the previoussection contain overlapping information (eg the NAD-bindingdomain of a malic enzyme would be identified both by thePFAM HMM PF03949 and the SMART HMM SM00919) Thus it isoften beneficial to use InterProScan [43] as this platform bringssuch lsquoredundantrsquo information from the different protein fami-lies under one common umbrella (for the malic enzyme NAD-binding domain regardless of whether it was identified viaSMART PFAM or both it would assign the InterPro IdentifierIPR012302 lsquoMalic enzyme NAD-bindingrsquo) InterProScan addition-ally can assign GO terms by mapping from InterPro identifiersto GO term(s) using a cross-mapper called Interpro2GO [44]Even though InterProScan does not always support the latestversion of all the databases a single tool that offers a diverserange of databases is of great benefit to users A notable non-HMM-based reference database offered by InterProScan is theConserved Domain Database (CDD) [45] which like PFAM com-prises protein domains but also features full length proteinalignments CDD relies on RPS-BLAST and thus ultimately onposition-specific scoring matrices [46] to identify sequencesimilarity From a userrsquos perspective it is interesting to notethat CDD also incorporates data from PFAM TIGRFAM andSMART offering another tool that incorporates several sourcessuch as InterProScan CDD offers the advantage of a simplersetup scenario than InterProScan as it is based on RPS-BLAST

Using genome-scale orthology finding

To increase or improve functional annotations genome-scaledraft-quality orthology detection is frequently incorporatedThis also helps in exploring protein family relationships andcomparative genomic approaches In the simplest case thiscould be a reciprocal best BLAST hit which offers a quick andeasy way to obtain a one-to-one relationship table Tools suchas Inparanoid [47] Orthofinder [48] and OrthoMCL [49] useBLAST and clustering algorithms in a convenient pipeline Eachoffers different benefits and the performance of several toolshas recently been compared by Altenhoff et al [50] However itshould be noted that incomplete transcriptomesgenomes canlead to misdetection of orthologs as the proper ortholog mightbe missing in the incomplete transcriptomegenome Also es-pecially for reconstructed transcriptomes it is not possible togenerate full-length sequences for all contigs This leads to add-itional decreases in ortholog detection accuracy which need tobe accounted for

In the case of closely related species one can refine orthol-ogy prediction further if full-genome information is available bymaking use of synteny ie that gene order remains conservedacross species [51] The online tool CoGe [52] offers an auto-mated pipeline to perform this task However this is a special-ized step that lies downstream of typical functional annotation

Adding information

In addition to gene function (captured by lsquomolecular functionrsquoor lsquobiological processrsquo in the GO ontology) it can be useful togain an insight into the topological considerations for plant pro-teins as well as their subcellular localization and potential post-transcriptional modification

Transmembrane domains

One approach for adding protein topology is predicting trans-membrane domains based on the protein sequence TMHMM[53] offers a simple Web-based solution for alpha helix detectionand can also be downloaded as a standalone tool for academicuse The free tool TOPCONS [54] which is actively being de-veloped combines a selection of prediction tools to provide aconsensus result This has demonstrated better performancebut its local installation is slightly more complex than TMHMMbecause of software dependencies A comprehensive listing oftransmembrane domain prediction tools is available in theAramemnon transmembrane database [55] and in a recent re-view [56] (Table 2)

Subcellular localization

To predict subcellular localization and thus the third GO do-main lsquocellular componentrsquo the general tool TargetP [57] or thesecretory signal peptides predictor SignalP [58] are frequentlyused These however tend to perform poorly in the case ofplants [59 60] so other plant-specific tools such as Plant-mPLoc[61] AtSubP [60] or for N-terminal targeting sequences the toolPredotar [62] may produce superior results (Table 2)

However finding an adequate performance evaluation isoften difficult To avoid biased results one needs to validate thepredictions on a data set which was not used for training of thepredictors [63] and it might be advisable to rely on several toolsas is done in the curated reference database for A thaliana pro-tein localization SUBA3 [64]

Posttranslation modifications

In the case where one is interested in signaling one can predictphosphorylation sites using four plant tools at the momentnamely PHOSFER [65] PhosPhAt [66] PlantPhos [67] and Musite[68] In terms of performance the latter three tools have re-cently been compared and it seemed that for serinethreoninepredictions at least in the model A thaliana Musite performedbest It was however noted that for tyrosine phosphorylationthe sensitivity can be lower for Musite at certain specificityranges [69] (Table 2)

Predicting function based on expression behavior

Finally one might venture into functional prediction usingnonsequence-based data A prime example is the lsquoguilt by as-sociationrsquo approach whereby one assumes that a gene to beannotated might exert function X (lsquoguiltrsquo) if it is co-expressed(lsquoassociatedrsquo) with one or several genes of the same knownfunction X [70ndash73] The underlying idea is that if several genesconsistently show the same expression there is a good chance

that they are co-regulated as they are needed for the sameprocess or pathway Insightful examples are macromolecularcomplexes such as ribosomes or the cellulose synthesis com-plex where this guilt-by-association approach works well [74]Although this approach usually requires many transcriptomicdata sets tissue-specific data sets are often available in gen-ome andor transcriptome projects which might prove to besufficient Indeed tissue-specific data might even be helpfulto unravel tissue-specific processes as has been done for Athaliana seed coat mucilage [75 76] In the case where meta-bolic data are available this might be used to complement theguilt-by-association approach using protocols described re-cently [77 78]

Caution needs to be taken as the guilt-by-association prin-ciple is not always reliable and must be evaluated critically Theapproach depends on the number of reliably annotated geneswithin a network Indeed even though it works well in caseswhere queries are restricted to cases similar to the ones listedabove (few genes in a well-annotated network) the usefulnessof the method decreases when the procedure is scaled up [79]

Table 2 Available resources to complement functional annotation

Resource Web address Comments

TMHMM httpwwwcbsdtudkservicesTMHMM Can be downloaded and installed locally for academics Onlineversion allows the submission of 10 000 sequences at most

TOPCONS httptopconsnet Can be downloaded and installed freely (GPL v2) Online versionallows the submission of 100 MB sequence data at most

TargetP httpwwwcbsdtudkservicesTargetP Can be downloaded and installed locally for academics The onlineversion allows the submission of 2000 sequences at most

Plant-mPLoc httpwwwcsbiosjtueducnbioinfplant-multi

At time of writing problem with multifasta submission

AtSubP httpbioinfo3nobleorgAtSubP Up to 2000 predictionsPredotar httpsurgiversaillesinrafrpredotarpredo

tarhtmlOnly N-terminal signals for mitochondria and chloroplasts

PHOSFER httpsaphireusaskcasaphirephosferindexhtml

Free for academic use only

PhosPhAt httpphosphatuni-hohenheimdephosphathtml

PlantPhos httpcsbcseyzuedutwPlantPhosPredicthtml

Uploads lt2 MB

Musite httpmusitenet 100 predictions can be downloaded and installed locally freely(GPL v3)

TAIRProteinInteraction Data

httpswwwarabidopsisorgdownloadindex-autojspdirfrac142Fdownload_files2FProteins2FProtein_interaction_data

Arabidopsis PredictedInteractome andArabidopsis interactionsViewer

ftpftparabidopsisorghometairProteinsProtein_interaction_dataInteractome20or httpbarutorontocainteractionscgi-binarabidopsis_interactions_viewercgi

Downloadable from TAIR these are the data for interactome v20(also available at the Arabidopsis Interactions viewer) In total70 000 predicted interactions and 3000 experimentally deter-mined interactions

IntAct httpwwwebiacukintact Interactions from literature curations or user submissions part ofthe IMEx consortium

AtPIN httpatpinbioinfoguynetcgi-binatpinpl Incorporates data from IntAct BioGRID TAIR PredictedInteractome for Arabidopsis AtPID

ANAP httpgmddshgmoorgComputational-BiologyANAP

Integrates 11 interaction databases

MIND httpsassociomicsdpbcarnegiescienceeduAssociomicsHomehtml

In total 12 102 high-confidence proteinndashprotein interactions basedon split-uniquitin system in yeast in addition gt3000Arabidopsis membrane proteins in a separate screen areincluded

PPIM httpcomp-sysbioorgppim Contains predictions and information form literaturePRIN httpbiszjueducnprin Predictions based on interlogs in various model organisms where

studies have been carried out

6 | Bolger et al

Protein interaction databases

A similar approach can be followed using the gene product theprotein where one searches for interacting proteins The datarepositories of TAIR (Arabidopsisorg) contain such an approachas well as links to a number of protein interaction resourcessuch as the Arabidopsis interactions viewer [80] IntAct [81]AtPIN [82] and ANAP [83] (Table 2) In addition GabiPD [84] andPhosPhAt [66] databases hold some information on kinases andtheir phosphorylation targets while in the case of membraneproteins the Membrane-based Interactome Database (MIND)[85] can be used The latter is particularly interesting becauseunlike databases containing various types of curated or pre-dicted data MIND rely on experimental results from severalrounds of testing using the split-ubiquitin system The effort forestablishing interaction databases is moving onto other plantspecies as can be seen in the ProteinndashProtein Interaction net-work for Maize (PPIM) [86] and a predicted rice interactome net-work (PRIN) [87] A combination of co-expression andinteraction data can improve the reliability of the functionalprediction [88]

microRNA and target predictions

In addition to protein-coding genes other prominent genomicfeatures that regulate gene expressions include noncodingRNAs This includes a diverse set of plant RNA molecules re-viewed in [89 90] which are transcribed but never translatedinto proteins Next-generation sequencing of these RNA specieswhich is typically performed using specialized RNASeq librariestargeting small RNAs has necessitated the development oftools which can quickly and easily deal with these data sets Anin-depth discussion of all small RNA (sRNA) tools would itselfwarrant a complete review so for the purposes of this articlewe will restrict our discussion to tools relevant to detection andanalysis of microRNA (miRNA) with reference to sRNAtoolbox[91] which offers a selection of user-friendly tools from expres-sion profiling to target gene prediction

miRNAs are a class of RNA that are involved in gene regula-tion Though similar in many respects to small interfering RNAmiRNA can have many target mRNAs and acts as a gene regula-tor (inhibitor) rather than in gene silencing To disambiguatethe two guidelines for the annotation of plant miRNAs havebeen proposed by [92]

One of the main repositories of knowledge for miRNA ismiRBase [93] The most recent release of the database contains28 645 entries representing precursor miRNAs expressing35 828 mature miRNA products in 223 species miRBase add-itionally serves as a registry for newly discovered miRNAs andprovides a naming service for miRNA genes Aside from provid-ing annotations and references for all published miRNA alsquoTargetrsquo pipeline is provided to predict the targets However asthis is only aimed at animal miRNA plant researchers are bestreferred to a recent benchmark [94] comparing many differentplant pipelines Unfortunately the outcome was that for speciesother than A thaliana the accuracy was generally not too highIt was therefore suggested [94] to use a union of predictionsstemming from Targetfinder [95] and psRNATarget [96] to maxi-mize finding potential targets at the cost of identifying manyfalse targets Alternatively highly confident predictions at thecost of losing many true targets were possible by only usingthose predictions made by both psRNATarget and Tapir in hy-brid mode [97]

Automated functional annotation pipelines

Given the dramatic increase in genome and transcriptomesequencing it is not surprising that the demand has grown forfast automated annotation pipelines that quickly providemeaningful biological data from these data sets Many of theearly large-scale genome projects had specific annotationgroups assigned to carry out this task eg TIGR for A thaliana [2]and ITAG for Solanum lycopersicum [98] These frequently fea-tured a combination of computational or automated annota-tions coupled with manual curations Several recent genomeprojects have to a greater extent used automated pipelineswhich may reflect the increasing quality of the tools available Itshould of course be noted that many of the automated pipelinesincorporate data which was manually curated in many of theearlier genomes Taken together this also shows that plant gen-ome analysis benefits from the time gain offered by automatedtools and increases the focus on analyses of more and differentdata sets

In general these tools can be partitioned based on theunderlying ontology used Probably the best known tool to inferGO annotations is BLAST2GO [99] which can also incorporateInterProScan and KEGG data BLAST2GO provides a user-friendly and well-integrated interface featuring locally installedsoftware offering graphical outputs maps etc However someof these features are not available in the free and academic ver-sion but require a license An alternative which is aimed atplant researchers who would like to apply GO terms is providedby the fully integrated TRAPID plant-specific pipeline TRAPIDoffers a Web-based analysis platform and alleviates the need toinstall software [35] As TRAPID also uses gene families it usu-ally should provide good annotation performance

To apply the KO entries (or K numbers) from the KEGG data-base to a gene set the popular online tool KEGG AutomaticAnnotation Server (KAAS) [100] provides a user-friendly inter-face This service relies on BLAST searches and either on unidir-ectional hits or on bidirectional hits together with someheuristics Recent updates to KEGG have introduced theBlastKOALA and GhostKOALA online tools which allow users toexploit data from KEGGrsquos internal annotation tool (KOALA)[101] Both tools target the nonredundant pangenomic data setgenerated from KEGGrsquos genes database with GhostKOALA usingthe GHOSTX search algorithm which is considered more appro-priate for metagenome annotation The result from these toolscan be used as input for the other KEGG modules (eg KEGGpathways)

The MapMan ontology can be inferred using the online toolMercator [102] This allows annotation of both protein and DNAsequences and incorporates BLAST and CDD searches as well asan optional InterProScan annotation In the case of DNA se-quence submission the file is simply analyzed in all six framesfor domain searches and the annotations merged with the ex-pectation that the correct frame will return the best resultUsers can optionally choose a selection of well-annotated plantgenomes to be included in the analysis

Performing annotations using locally installableresources

For the more computation savvy researcher who has access todecent computing resources Trinotate (httpstrinotategithubio) offers a comprehensive annotation suite which extends thepopular Trinity RNASeq assembly pipeline [8] It comprises aBLAST search against the manually assigned SWISSPROT data

set HMMER searches against the PFAM database as well as find-ing signal sequences and predicting subcellular localizationusing TMHMM [53] and SignalP [58] respectively In addition itallows inclusion of the RNAmmer [103] tool which is used toidentify rRNA transcripts and integrates a selection of annota-tion databases (KEGG EggNOG and GO) An alternative is theplant-specific AHRD pipeline (httpsgithubcomgroupschoofAHRD) which provides a consensus annotation based on a setof input gene descriptions obtained by sequence similaritysearches The annotations are the result of scoring the inputgene annotations according to both their frequency and thereputation of the database from which it was derived

The sets of tools described in the previous two sectionsusing controlled terms or annotations are often preferred giventhe ease with which the results can be compared against othersimilarly annotated genomes Prominent examples that usethese tools include the melon genome which used KAAS [104]the peanut ancestor genome which used AHRD [105] and thegenome of the wild tomato which used Mercator [106]

An example of annotation pipelines using rapeseedproteins

Table 3 provides a small survey of the integrated tools by anno-tating 1476 rapeseed proteins using the automated annotationpipelines It is evident that the tools relying on their own

infrastructure generally deliver results quickly In addition theannotation rate ranges from 26 for the KEGG-based toolslikely based on KEGGrsquos stronger focus on metabolism to 56 forGO or MAPMAN-based terms The latter value compares withthe annotation rate of about 51 (from the downloaded refer-ence) when counting any GO term (including lsquocellular compo-nentrsquo) In contrast BLAST2GO reaches a higher annotation rateof 78 but requires 10more time when run on a typical work-station type laptop (eg i7 Quadcore) As noted above such ahigh annotation rate (especially as it is higher than the refer-ence) could result from aggressive (and thus sensitive) standardsettings which potentially should be further tuned when anno-tating plant genomes Nevertheless BLAST2GO might providevaluable leads into less likely functions and as was the case forTAIR most annotations were for the lsquoCellular Componentrsquo do-main as for 70 of the genes a cellular component domain GOterm could be determined It is noteworthy that both plant-specific pipelines (TRAPID and Mercator) reach similar annota-tion rates which is likely because of their specific fine tuning toplant-derived proteins

Integrating the output from several pipelines has beenshown to be beneficial such as in the case of the potato crop[107] In this pipeline the authors used Trinotate BLAST2GOOrthoMCL together with other tools to produce an Ensembleclassifier by counting how many different pipelines a certainGO term was detected Interestingly by using even simple

Table 3 Integrated tools for the functional analysis of plant genomes

Resource Time taken Annotationrate ()

Comments

Reference mdash 51 At least one GO term assigned including cellular componentBlast2GO 8 h 23 min 78 BLAST is performed locally or as WebBLAST via NCBI InterProScan is performed as a

Web service at the European Bioinformatics Institute (EBI)KAAS 10 min (only single-

directional best hit (SBH)was used as a surveysample of sequence)

29 Runs as a Web service no user resources needed

GhostKOALA 28 min 26 Runs as a Web service no user resources neededMercator 5 min 56 Runs as a Web service no user resources neededTRAPID 5 min 56 Runs as a Web service no user resources needed

Note For the analysis the first 1476 proteins from the Brassica proteome version 5 were downloaded from httpwwwgenoscopecnsfrbrassicanapusdata alongside

their GO annotations representing exactly 10 000 lines of text and submitted to the various services where available searches were limited to plant data sets In the

case of Blast2GO WebBLAST was used We have rounded the values as annotations are subjected to updates and time taken will depend on server loads Therefore

these values should be seen as a general orientation

Table 4 Tools and Web sites useful in annotating large protein families

Resource Function Web address

CoGe Compares genomes find synteny httpsgenomevolutionorgPlantTFDB Plant Transcription Factor families httpplanttfdbcbipkueducnPotsdam plntfdb Plant Transcription Factor families httpplntfdbbiouni-potsdamdev30P450 Database P450 protein families httpdrnelsonuthsceduCytochromeP450htmlCAZy Enzymes acting on carbohydrates httpwwwcazyorgAramemnona Plant membrane proteins httparamemnonuni-koelndeMerops Database Peptidases httpmeropssangeracukPLAZA Generalist Plant Family database httpbioinformaticspsbugentbeplazaGreenPhylDB Generalist Plant Family database wwwgreenphylorg

Note aAlso lists a comprehensive set of tools for transmembrane domains subcellular localization and lipid modifications

8 | Bolger et al

Ensembles they increased the concordance with literature an-notations A similar multiple site data retrieval strategy wasused for the wheat database dbWFA [108] which provides adata warehouse strategy and thus allows querying and com-bining different annotations per wheat gene to provide a morecomprehensive picture

Potential pitfalls and difficult gene families

Generally speaking the high-throughput tools mentioned aboveuse methods which can quickly identify the general gene func-tion This frequently relies on identifying the protein familybased on the aforementioned HMMs Although this is often suf-ficient there are many cases where a finer-grained approach isnecessary Plant genomes are renowned for containing largegene families such as transcription factors which can easilynumber in the thousands The Cytochrome P450 family of genesis known to be large in many organisms and it is frequently thetarget of scientific interest in plants given their prominent role

in secondary metabolite biosynthesis which is particularly im-portant to medicinal plants Indeed the two large projectsPhytoMetaSyn [109] and Medicinal Plant Genome Resource [110]are dedicated to the transcriptome analysis of medicinal plants

In cases of large or difficult gene families it is often neces-sary to analyze data in detail by building gene family treeswhich first require careful multiple sequence alignments Thisdismantles large groups of genes into individual genes orsmaller clades allowing distinct functions to be applied An al-ternative andor complementary approach could involve ana-lyzing syntenic relationships using the CoGE resource [111]There are many resources available that are dedicated to anno-tating difficult gene families such as PlantTFDB and PotsdamPlnTFDB for plant transcription factors [112ndash114] the P450 data-base for P450 enzymes [115] CAZy for carbohydrate active en-zymes [116] Merops for peptidases [117] and Aramemnon forplant membrane proteins [55] which are summarized inTable 4 These are complemented by the more generalist plantfamily database PLAZA [15] and GreenPhylDB [118] On a

Figure 2 Flowchart for the annotation of plant genomestranscriptomes

broader level using detailed phylogenetic information in thegenomic era brings in phylogenomics tools whose use is re-viewed in [119]

Recipe

Based on the discussion above and focusing on the use of onlineresources one could annotate a plant genome almost automat-ically following the steps below (Figure 2)

(i) Generation of protein sequences The first question oneshould ask is whether protein sequences are available (which istypically the case in a genome project) or not (which is typicallythe case in a transcriptome project) If protein-coding sequencesare not available these should be generated from transcript se-quences using eg FrameDP or the AUGUSTUSBREAKER1 pipe-line for genome assemblies Other tools to perform this task arediscussed in the section lsquoFinding coding regions in transcrip-tome assembliesrsquo This would provide a common input for thesubsequent annotation regardless of the starting approach

(iia) Annotation One would then submit the resulting pro-tein sequences as one file to the following three online re-sources As all these services make use of their own high-performance computing pipelines and are free for academicusers they can be run in parallel

bull KAAS can be used to infer KEGG termsbull The TRAPID pipeline can be used for GO termsbull Mercator can be used for MapMan terms

At this stage one could also use BLAST2GO to infer GO anno-tations however this is only possible if one either has a licenseor is an academic user It should be considered however thatBLAST2GO has a much longer run-time which could impedesubsequent genome annotation analysis tasks Thus a decisionis needed if the additional time is worth the extra annotationswhich are potentially not provided by annotation alternativeslike TRAPID

(iib) (optional) Validation of annotations In cases wherelocal computational resources are available one should add-itionally run EggNOG scans on the side to further validate andcompare the derived functional annotations To test if this pro-cedure is feasible using the available equipment we recom-mend running a truncated sample of eg 100ndash1000 proteinsequences first

The above two steps provide a fast solution to arrive at aplethora of terms which can be easily combined using evensimple tools like MS Excel where one could add the differentontologies into separate rows for inspection Even though thedifferent ontologies cannot be directly compared with eachother they help in understanding the genome in their ownright

This would already provide a good working annotation formany research topics and could be used to answer questionssuch as Are certain processes occurring or not Or do we seemore genes in secondary metabolism than in related plants

(iii) Additional information One can annotate transmem-brane domains using TMHMM andor TOPCons (the online ver-sions of both were relatively easy to use in our hands but theonline TMHMM tool was significantly faster) For TMHMM onemight have to split the protein sequence file obtained from Step(i) into several batches In the simplest case one could do thisby hand in a text editor such as Notepadthornthorn on Windows orTextWrangler on MacOS

(iv) Similarly as for transmembrane domains (ie after split-ting of the file from Step (i)) subcellular localization can be pre-dicted using the online tools TargetP andor AtSubP

(v) Finally one could use the PhosPhat and PHOSFER onlinetools for a prediction of phosphorylation sites

After the annotation process is complete it is advisable tolook at a selection of these annotations to verify whether theyare correct A good choice for a more experiment-oriented re-searcher would be to focus on the genes or gene families whichone works with in the laboratory This verifies that the expectedannotations are present and that no wrong annotations hadbeen added Alternatively or in addition manually comparinggenes described in the literature with their automaticallyderived annotations are highly recommended

Key Points

bull Annotating plant genomes should (also) rely on ontol-ogies and there are several complementary resourcesavailable

bull Generally large-scale annotation relies on homologytransfer and it is complemented by finding domainsand protein families

bull Annotation can be further complemented by add-itional predictions such as transmembrane domainssubcellular localization and phosphorylation sites

bull Plant genome annotation can be performed automat-ically using the plant-specific tool Mercator as well asthe generalists TRAPID (which features a special plantmodule) KAASBlastKOALA and Blast2GO without anysignificant computational resources

bull It is important to quality check the annotation and toremember that all annotations should be treated ashypotheses

Funding

The German Ministry for Education and Research (referencenumber 0315961 in partial) the Ministry of InnovationScience and Research of North-Rhine Westphalia within theframework of the North-Rhine Westphalia StrategieprojektBioEconomy Science Center (grant number 313323ndash400ndash00213) and the FRS-FNRS Fonds de la RechercheScientifique as a Charge de recherches (grant number10841946)

References1 Bolger ME Weisshaar B Scholz U et al Plant genome

sequencing - applications for crop improvement Curr OpinBiotechnol 20142631ndash7

2 Arabidopsis Genome I Analysis of the genome sequence ofthe flowering plant Arabidopsis thaliana Nature2000408796ndash815

3 Stanke M Diekhans M Baertsch R et al Using native andsyntenically mapped cDNA alignments to improve de novogene finding Bioinformatics 200824637ndash44

4 Sleator RD An overview of the current status of eukaryotegene prediction strategies Gene 20104611ndash4

5 Campbell MS Law M Holt C et al MAKER-P a tool kit for therapid creation management and quality control of plantgenome annotations Plant Physiol 2014164513ndash24

10 | Bolger et al

6 Hoff KJ Lange S Lomsadze A et al BRAKER1 unsupervisedRNA-seq-based genome annotation with GeneMark-ET andAUGUSTUS Bioinformatics 201632767ndash9

7 Hirsch CN Buell CR Tapping the promise of genomics inspecies with complex nonmodel genomes Annu Rev PlantBiol 20136489ndash110

8 Grabherr MG Haas BJ Yassour M et al Full-length transcrip-tome assembly from RNA-Seq data without a reference gen-ome Nat Biotechnol 201129644ndash52

9 Gongora-Castillo E Buell CR Bioinformatics challenges in denovo transcriptome assembly using short read sequences inthe absence of a reference genome sequence Nat Prod Rep201330490ndash500

10 Schliesky S Gowik U Weber AP et al RNA-seq assemblymdashAre we there yet Front Plant Sci 20123220

11 Honaas LA Wafula EK Wickett NJ et al Selecting superior denovo transcriptome assemblies lessons learned by leverag-ing the best plant genome PLoS One 201611e0146062

12 Berardini TZ Reiser L Li D et al The Arabidopsis informa-tion resource making and mining the ldquogold standardrdquo anno-tated reference plant genome Genesis 201553474ndash85

13 Krishnakumar V Hanlon MR Contrino S et al Araport theArabidopsis information portal Nucleic Acids Res201543D1003ndash9

14 Lamesch P Berardini TZ Li D et al The ArabidopsisInformation Resource (TAIR) improved gene annotationand new tools Nucleic Acids Res 201240D1202ndash10

15 Proost S Van Bel M Vaneechoutte D et al PLAZA 30 an ac-cess point for plant comparative genomics Nucleic Acids Res201543D974ndash81

16 Hoehndorf R Schofield PN Gkoutos GV The role of ontolo-gies in biological and biomedical research a functional per-spective Brief Bioinform 2015161069ndash80

17 Carbon S Ireland A Mungall CJ et al AmiGO online accessto ontology and annotation data Bioinformatics200925288ndash9

18 Kanehisa M Sato Y Kawashima M et al KEGG as a referenceresource for gene and protein annotation Nucleic Acids Res201644D457ndash62

19 Chae L Kim T Nilo-Poyanco R et al Genomic signatures ofspecialized metabolism in plants Science 2014344510ndash3

20 Nikoloski Z Perez-Storey R Sweetlove LJ Inference and pre-diction of metabolic network fluxes Plant Physiol20151691443ndash55

21 Tello-Ruiz MK Stein J Wei S et al Gramene 2016 compara-tive plant genomics and pathway resources Nucleic Acids Res201644D1133ndash40

22 Nomenclature committee of the international union of bio-chemistry and molecular biology (NC-IUBMB) EnzymeSupplement 5 (1999) Eur J Biochem 1999264610ndash50

23 Thimm O Blasing O Gibon Y et al MAPMAN a user-driventool to display genomics data sets onto diagrams of meta-bolic pathways and other biological processes Plant J200437914ndash39

24 Bargsten JW Severing EI Nap J-P et al Biological process an-notation of proteins across the plant kingdom Curr Plant Biol2014173ndash82

25 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

26 Eddy SR A new generation of homology search tools basedon probabilistic inference Genome Inform 200923205ndash11

27 Goodstein DM Shu S Howson R et al Phytozome a com-parative platform for green plant genomics Nucleic Acids Res201240D1178ndash86

28 Bolser D Staines DM Pritchard E et al Ensembl plants inte-grating tools for visualizing mining and analyzing plantgenomics data Methods Mol Biol 20161374115ndash40

29 UniProt Consortium UniProt a hub for protein informationNucleic Acids Res 201543D204ndash12

30 Iseli C Jongeneel CV Bucher P ESTScan a program for de-tecting evaluating and reconstructing potential coding re-gions in EST sequences Proc Int Conf Intell Syst Mol Biol1999138ndash48

31 Lottaz C Iseli C Jongeneel CV et al Modeling sequencingerrors by combining hidden Markov models Bioinformatics200319 (Suppl 2)ii103ndash12

32 Haas BJ Papanicolaou A Yassour M et al De novo transcriptsequence reconstruction from RNA-seq using the Trinityplatform for reference generation and analysis Nat Protoc201381494ndash512

33 Gouzy J Carrere S Schiex T FrameDP sensitive peptide de-tection on noisy matured sequences Bioinformatics200925670ndash1

34 Tang S Lomsadze A Borodovsky M Identification of proteincoding regions in RNA transcripts Nucleic Acids Res201543e78

35 Van Bel M Proost S Van Neste C et al TRAPID an efficientonline tool for the functional and comparative analysis of denovo RNA-Seq transcriptomes Genome Biol 201314R134

36 May P Wienkoop S Kempa S et al Metabolomics- andproteomics-assisted genome annotation and analysis of thedraft metabolic network of Chlamydomonas reinhardtiiGenetics 2008179157ndash66

37 Silmon de Monerri NC Weiss LM Integration of RNA-seq andproteomics data with genomics for improved genome anno-tation in Apicomplexan parasites Proteomics 2015152557ndash9

38 Finn RD Coggill P Eberhardt RY et al The Pfam protein fam-ilies database towards a more sustainable future NucleicAcids Res 201644D279ndash85

39 Haft DH Selengut JD Richter RA et al TIGRFAMs and gen-ome properties in 2013 Nucleic Acids Res 201341D387ndash95

40 Mi H Poudel S Muruganujan A et al PANTHER version 10expanded protein families and functions and analysis toolsNucleic Acids Res 201644D336ndash42

41 Letunic I Doerks T Bork P SMART recent updates new de-velopments and status in 2015 Nucleic Acids Res201543D257ndash60

42 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644D286ndash93

43 Jones P Binns D Chang HY et al InterProScan 5 genome-scaleprotein function classification Bioinformatics 2014301236ndash40

44 Mitchell A Chang HY Daugherty L et al The InterPro pro-tein families database the classification resource after 15years Nucleic Acids Res 201543D213ndash21

45 Marchler-Bauer A Derbyshire MK Gonzales NR et al CDDNCBIrsquos conserved domain database Nucleic Acids Res201543D222ndash6

46 Marchler-Bauer A Bryant SH CD-Search protein domainannotations on the fly Nucleic Acids Res 200432W327ndash31

47 Sonnhammer EL Ostlund G InParanoid 8 orthology ana-lysis between 273 proteomes mostly eukaryotic NucleicAcids Res 201543D234ndash9

48 Emms DM Kelly S OrthoFinder solving fundamental biasesin whole genome comparisons dramatically improvesorthogroup inference accuracy Genome Biol 201516157

49 Li L Stoeckert CJ Jr Roos DS OrthoMCL identification ofortholog groups for eukaryotic genomes Genome Res2003132178ndash89

50 Altenhoff AM Boeckmann B Capella-Gutierrez S et alStandardized benchmarking in the quest for orthologs NatMethods 201613425ndash30

51 Schnable JC Freeling M Lyons E Genome-wide analysis ofsyntenic gene deletion in the grasses Genome Biol Evol20124265ndash77

52 Tang H Bomhoff MD Briones E et al SynFind compilingsyntenic regions across any set of genomes on demandGenome Biol Evol 201573286ndash98

53 Krogh A Larsson B von Heijne G et al Predictingtransmembrane protein topology with a hidden Markovmodel application to complete genomes J Mol Biol2001305567ndash80

54 Tsirigos KD Peters C Shu N et al The TOPCONS web serverfor consensus prediction of membrane protein topology andsignal peptides Nucleic Acids Res 201543W401ndash7

55 Schwacke R Schneider A van der Graaff E et alARAMEMNON a novel database for Arabidopsis integralmembrane proteins Plant Physiol 200313116ndash26

56 Gromiha MM Ou YY Bioinformatics approaches for func-tional annotation of membrane proteins Brief Bioinform201415155ndash68

57 Emanuelsson O Brunak S von Heijne G et al Locating pro-teins in the cell using TargetP SignalP and related tools NatProtoc 20072953ndash71

58 Petersen TN Brunak S von Heijne G et al SignalP 40 dis-criminating signal peptides from transmembrane regionsNat Methods 20118785ndash6

59 Meinken J Min J Computational prediction of protein sub-cellular locations in eukaryotes an experience reportComput Mol Biol 201221ndash7

60 Kaundal R Saini R Zhao PX Combining machine learning andhomology-based approaches to accurately predict subcellularlocalization in Arabidopsis Plant Physiol 201015436ndash54

61 Chou KC Shen HB Plant-mPLoc a top-down strategy to aug-ment the power for predicting plant protein subcellular lo-calization PLoS One 20105e11335

62 Small I Peeters N Legeai F et al Predotar a tool for rapidlyscreening proteomes for N-terminal targeting sequencesProteomics 200441581ndash90

63 Ryngajllo M Childs L Lohse M et al SLocX predicting sub-cellular localization of arabidopsis proteins leveraging geneexpression data Front Plant Sci 2011243

64 Tanz SK Castleden I Hooper CM et al SUBA3 a database forintegrating experimentation and prediction to define thesubcellular location of proteins in Arabidopsis Nucleic AcidsRes 201341D1185ndash91

65 Trost B Kusalik A Computational phosphorylation site pre-diction in plants using random forests and organism-specific instance weights Bioinformatics 201329686ndash94

66 Durek P Schmidt R Heazlewood JL et al PhosPhAt theArabidopsis thaliana phosphorylation site database An up-date Nucleic Acids Res 201038D828ndash34

67 Lee TY Bretana NA Lu CT PlantPhos using maximal de-pendence decomposition to identify plant phosphorylationsites with substrate site specificity BMC Bioinformatics201112261

68 Gao J Xu D The Musite open-source framework forphosphorylation-site prediction BMC Bioinformatics 201011(Suppl 12)S9

69 Yao Q Schulze WX Xu D Phosphorylation site prediction inplants Methods Mol Biol 20151306217ndash28

70 Usadel B Obayashi T Mutwil M et al Co-expression toolsfor plant biology opportunities for hypothesis generationand caveats Plant Cell Environ 2009321633ndash51

71 Tohge T Fernie AR Co-expression and co-responses withinand beyond transcription Front Plant Sci 20123248

72 Mutwil M Usadel B Schutte M et al Assembly of an inter-active correlation network for the Arabidopsis genomeusing a novel heuristic clustering algorithm Plant Physiol201015229ndash43

73 Aoki Y Okamura Y Tadaka S et al ATTED-II in 2016 a plantcoexpression database towards lineage-specific coexpres-sion Plant Cell Physiol 201657e5

74 Persson S Wei H Milne J et al Identification of genesrequired for cellulose synthesis by regression analysis ofpublic microarray data sets Proc Natl Acad Sci USA20051028633ndash8

75 Voiniciuc C Gunl M Schmidt MH et al Highly branchedXylan Made by irregular XYLEM14 and MUCILAGE-RELATED21 links mucilage to Arabidopsis seeds Plant Physiol20151692481ndash95

76 Voiniciuc C Schmidt MH Berger A et al MUCILAGE-RELATED10 produces galactoglucomannan that maintainspectin and cellulose architecture in Arabidopsis seed muci-lage Plant Physiol 2015169403ndash20

77 Tohge T Fernie AR Annotation of plant gene function viacombined genomics metabolomics and informatics J VisExp 2012e3487

78 Tohge T Fernie AR Combining genetic diversity informat-ics and metabolomics to facilitate annotation of plant genefunction Nat Protoc 201051210ndash27

79 Gillis J Pavlidis P ldquoGuilt by associationrdquo is the exception ra-ther than the rule in gene networks PLoS Comput Biol20128e1002444

80 Geisler-Lee J OrsquoToole N Ammar R et al A predicted interac-tome for Arabidopsis Plant Physiol 2007145317ndash29

81 Orchard S Ammari M Aranda B et al The MIntAct projectndashIntAct as a common curation platform for 11 molecularinteraction databases Nucleic Acids Res 201442D358ndash63

82 Brandao MM Dantas LL Silva-Filho MC AtPIN Arabidopsisthaliana protein interaction network BMC Bioinformatics200910454

83 Wang C Marshall A Zhang D et al ANAP an integratedknowledge base for Arabidopsis protein interaction networkanalysis Plant Physiol 20121581523ndash33

84 Usadel B Schwacke R Nagel A et al GabiPDmdashthe GABI pri-mary database integrates plant proteomic data with gene-centric information Front Plant Sci 20123154

85 Lalonde S Sero A Pratelli R et al A membrane proteinsig-naling protein interaction network for Arabidopsis versionAMPv2 Front Physiol 2010124

86 Zhu G Wu A Xu XJ et al PPIM a protein-protein interactiondatabase for Maize Plant Physiol 2016170618ndash26

87 Gu H Zhu P Jiao Y et al PRIN a predicted rice interactomenetwork BMC Bioinformatics 201112161

88 Piya S Shrestha SK Binder B et al Protein-protein inter-action and gene co-expression maps of ARFs and AuxIAAsin Arabidopsis Front Plant Sci 20145744

89 Borges F Martienssen RA The expanding world of smallRNAs in plants Nat Rev Mol Cell Biol 201516727ndash41

90 Axtell MJ Classification and comparison of small RNAs fromplants Annu Rev Plant Biol 201364137ndash59

12 | Bolger et al

91 Rueda A Barturen G Lebron R et al sRNAtoolbox an inte-grated collection of small RNA research tools Nucleic AcidsRes 201543W467ndash73

92 Meyers BC Axtell MJ Bartel B et al Criteria for annotation ofplant microRNAs Plant Cell 2008203186ndash90

93 Kozomara A Griffiths-Jones S miRBase annotating highconfidence microRNAs using deep sequencing data NucleicAcids Res 201442D68ndash73

94 Srivastava PK Moturu TR Pandey P et al A comparison ofperformance of plant miRNA target prediction tools and thecharacterization of features for genome-wide target predic-tion BMC Genomics 201415348

95 Fahlgren N Howell MD Kasschau KD et al High-throughputsequencing of Arabidopsis microRNAs evidence for fre-quent birth and death of MIRNA genes PLoS One 20072e219

96 Dai X Zhao PX psRNATarget a plant small RNA target ana-lysis server Nucleic Acids Res 201139W155ndash9

97 Bonnet E He Y Billiau K et al TAPIR a web server for theprediction of plant microRNA targets including targetmimics Bioinformatics 2010261566ndash8

98 Tomato GC The tomato genome sequence provides insightsinto fleshy fruit evolution Nature 2012485635ndash41

99 Conesa A Gotz S Blast2GO a comprehensive suite for func-tional analysis in plant genomics Int J Plant Genomics20082008619832

100 Moriya Y Itoh M Okuda S et al KAAS an automatic genomeannotation and pathway reconstruction server Nucleic AcidsRes 200735W182ndash5

101 Kanehisa M Sato Y Morishima K BlastKOALA andGhostKOALA KEGG tools for functional characterization ofgenome and metagenome sequences J Mol Biol2016428726ndash31

102 Lohse M Nagel A Herter T et al Mercator a fast and simpleweb server for genome scale functional annotation of plantsequence data Plant Cell Environ 2014371250ndash8

103 Lagesen K Hallin P Rodland EA et al RNAmmer consistentand rapid annotation of ribosomal RNA genes Nucleic AcidsRes 2007353100ndash8

104 Garcia-Mas J Benjak A Sanseverino W et al The genome ofmelon (Cucumis melo L) Proc Natl Acad Sci USA201210911872ndash7

105 Bertioli DJ Cannon SB Froenicke L et al The genome se-quences of Arachis duranensis and Arachis ipaensis the dip-loid ancestors of cultivated peanut Nat Genet201648438ndash46

106 Bolger A Scossa F Bolger ME et al The genome of thestress-tolerant wild tomato species Solanum pennellii NatGenet 2014461034ndash8

107 Amar D Frades I Danek A et al Evaluation and integrationof functional annotation pipelines for newly sequenced or-ganisms the potato genome as a test case BMC Plant Biol201414329

108 Vincent J Dai Z Ravel C et al dbWFA a web-based databasefor functional annotation of Triticum aestivum transcriptsDatabase (Oxford) 20132013bat014

109 Xiao M Zhang Y Chen X et al Transcriptome analysis basedon next-generation sequencing of non-model plants pro-ducing specialized metabolites of biotechnological interestJ Biotechnol 2013166122ndash34

110 Gongora-Castillo E Fedewa G Yeo Y et al Genomicapproaches for interrogating the biochemistry of medicinalplant species Methods Enzymol 2012517139ndash59

111 Lyons E Freeling M How to usefully compare homologousplant genes and chromosomes as DNA sequences Plant J200853661ndash73

112 Jin J Zhang H Kong L et al PlantTFDB 30 a portal for thefunctional and evolutionary study of plant transcription fac-tors Nucleic Acids Res 201442D1182ndash7

113 Perez-Rodriguez P Riano-Pachon DM Correa LG et alPlnTFDB updated content and new features of the planttranscription factor database Nucleic Acids Res201038D822ndash7

114 Palaniswamy SK James S Sun H et al AGRIS and AtRegNeta platform to link cis-regulatory elements and transcriptionfactors into regulatory networks Plant Physiol2006140818ndash29

115 Nelson DR The cytochrome p450 homepage Hum Genomics2009459ndash65

116 Lombard V Golaconda Ramulu H Drula E et al Thecarbohydrate-active enzymes database (CAZy) in 2013Nucleic Acids Res 201442D490ndash5

117 Rawlings ND Barrett AJ Finn R Twenty years of theMEROPS database of proteolytic enzymes their substratesand inhibitors Nucleic Acids Res 201644D343ndash50

118 Rouard M Guignon V Aluome C et al GreenPhylDB v20comparative and functional genomics in plants NucleicAcids Res 201139D1095ndash102

119 Kristensen DM Wolf YI Mushegian AR et al Computationalmethods for Gene Orthology inference Brief Bioinform201112379ndash91

bbw135-TF1

bbw135-TF2

and consisting uniquely of gene rich data Popular transcrip-tome assembly tools such as TRINITY [8] require significant op-timization to produce an assembly of reasonable quality Forrecipes and cookbooks in the plant field one can refer to [9ndash11]

Once the gene structures have been detected the necessarynext step is to ascribe biological function to the genes in a pro-cess known as functional annotation Surprisingly performingthis task to a degree of accuracy remains challenging despitethe extensive accrual of knowledge about gene function inmodel and crop species Indeed there is still a large percentageof genes many of which can be found across multiple specieswhose function has not been ascertained

Within the plant community A thaliana remains the bestannotated plant largely because of the tremendous effort of TheArabidopsis Information Resource (TAIR) which integratescommunity-based curations together with annotations from lit-erature evidence Over 2800 experimentally supported annota-tions have been included within the past 2 years alone [12] Thiswealth of data has been adopted and further augmented byAraPort [13] an open-source resource which encourages thecommunity to contribute not only data modules but visualiza-tion tools and apps Despite these extensive resourcespublished data [14] indicate that still only about 77 of the pro-tein-coding sequences could be assigned any kind of structuredannotation This figure is in agreement with data from thePLAZA database an online platform that has processed the an-notations from several plant species into a uniform format [15]

Controlled terms and vocabularies for plantfunctional annotations

Homology-based functional annotation is the transfer of exist-ing knowledge about a gene sequence to another gene sequencewithin the same species or to another species This process es-sentially depends on the existing knowledge about a gene func-tion being transferable to genes of a similar sequence andassumes that this similarity reflects functional homologyAlthough an experimentalist working with a non-model speciesmay likely be content with an annotation such as lsquoquite similarto an Arabidopsis thaliana malic enzymersquo this annotation bearsseveral difficulties for a sustainable annotation frameworkThis also hampers structured analysis of genome-wide data toanswer questions like lsquohow many genes are involved in photo-synthesis or glycolysisrsquo This problem can largely be alleviatedby using controlled vocabularies and functional ontologies [16]to provide a consistent description of gene products across dif-ferent species

The Gene Ontology ontology

The most widely used functional annotation is lsquoGene Ontologyrsquo(GO) that provides defined lsquoGO termsrsquo to enable gene productsto be described by three separate domains lsquoBiological Processrsquowhich describes the gene in terms of a recognized series ofevents or molecular functions lsquoCellular Componentrsquo describingthe location of a protein (or rather biomolecule) at a cellularandor macromolecular level and lsquoMolecular Functionrsquo describ-ing the jobs or abilities that a gene product has on the molecularlevel Besides GO terms each GO annotation contains an lsquoevi-dence codersquo which provides information on how a GO term wasapplied to a gene Evidence codes indicate whether the annota-tion is based on experimental evidence computational analysisauthor statements or curatorial statements all of which aremanually curated GO annotations also contain evidence codes

which are used to indicate assignment by automaticcomputa-tional methods This has the advantage that annotations basedon experimental data can be treated with much higherconfidence than automatic annotations of related proteinsIn addition by qualifying where such an annotation came fromit is easier to check the respective annotation In this respectcurated GO resources such as the one for A thaliana representan invaluable resource

The GO ontology is structured as a directed acyclical graphmaking it possible to infer more general terms from a specificterm This additionally allows grouping data eg our malicenzyme might be annotated with the GO term lsquoGO0009763rsquolsquoNAD-malic enzyme C4 photosynthesisrsquo from which one couldimmediately deduce using eg the Amigo browser [17] that theterms lsquoGO0015979rsquo lsquophotosynthesisrsquo and lsquoGO0015977rsquo lsquocarbonfixationrsquo also apply

The Kyoto Encyclopedia of Genes and Genomes ontology

Another widely used resource is the Kyoto Encyclopedia ofGenes and Genomes (KEGG httpwwwkeggjp) This featuresa number of databases that aim to link genomic- and molecu-lar-level information to higher-level functions of the cell organ-ism and the ecosystem Annotation with KEGG is based onassociating molecular function with orthologous groups whichare defined based on clustering of genes from completed gen-omes (currently gt4000 genomes) using the KEGGrsquos internallsquoKEGG Orthology and Links Annotationrsquo (KOALA) program Theresulting information is stored in the lsquoKEGG Orthologyrsquo (KO)database and assignment of KO entry identifier (also called Knumbers) provides the gene annotation KEGG aims to includereference to primary literature for each KO entry (76 of around19 000 KO entries contains this as of September 2015) [18]

CYC and other metabolic resources

In the case of enzymatic reactions there is also the CYC net-work whose plant section is under the Plant Metabolic network(httpwwwplantcycorg) umbrella [19] This is mainly used todescribe enzymatic functions and while it enables one to buildreaction networks [20] it does not cover additional functionalterms The plant Reactome is a database of plant metabolic andregulatory pathways which have been curated for the referencespecies rice and applied to 58 other plant species [21] Finallythe Enzyme Commission numbers [22] (httpwwwchemqmulacukiubmbenzyme) describe reactions and classify enzymeswhich are also referenced by KEGG CYC and ReactomeAlthough CYC uses citations to the primary literature exten-sively the enzymes within CYC are not formally linked to anno-tations via evidence codes

The MapMan BIN ontology

One additional ontology resource that is specifically plantfocused is the MapMan BIN ontology This was originallydevised to visualize omics data on plant pathways [23] but hasgrown since and currently comprises around 2000 ontologicalterms The MapMan ontology is modeled in a hierarchical treestructure with higher-level categories based on biological pro-cess and leaf categories containing detailed function The struc-ture was manually defined by experts in the respective fieldsand changes are applied periodically based on primary litera-ture Although MapMan endeavors to assign lsquoevidencersquo to theBINs (httpmapmangabipdorgwebguestmapcave) these arecurrently updated as new releases are published As this

2 | Bolger et al

4 | Bolger et al

Adding information

Uploads lt2 MB

6 | Bolger et al

Comments

8 | Bolger et al

Recipe

Key Points

Funding

10 | Bolger et al

12 | Bolger et al

bbw135-TF1

bbw135-TF2

4 | Bolger et al

Adding information

Uploads lt2 MB

6 | Bolger et al

Comments

8 | Bolger et al

Recipe

Key Points

Funding

10 | Bolger et al

12 | Bolger et al

bbw135-TF1

bbw135-TF2

4 | Bolger et al

Adding information

Uploads lt2 MB

6 | Bolger et al

Comments

8 | Bolger et al

Recipe

Key Points

Funding

10 | Bolger et al

12 | Bolger et al

bbw135-TF1

bbw135-TF2

Adding information

Uploads lt2 MB

6 | Bolger et al

Comments

8 | Bolger et al

Recipe

Key Points

Funding

10 | Bolger et al

12 | Bolger et al

bbw135-TF1

bbw135-TF2

Uploads lt2 MB

6 | Bolger et al

Comments

8 | Bolger et al

Recipe

Key Points

Funding

10 | Bolger et al

12 | Bolger et al

bbw135-TF1

bbw135-TF2

Comments

8 | Bolger et al

Recipe

Key Points

Funding

10 | Bolger et al

12 | Bolger et al

bbw135-TF1

bbw135-TF2

Comments

8 | Bolger et al

Recipe

Key Points

Funding

10 | Bolger et al

12 | Bolger et al

bbw135-TF1

bbw135-TF2

Recipe

Key Points

Funding

10 | Bolger et al

12 | Bolger et al

bbw135-TF1

bbw135-TF2

Recipe

Key Points

Funding

10 | Bolger et al

12 | Bolger et al

bbw135-TF1

bbw135-TF2

12 | Bolger et al

bbw135-TF1

bbw135-TF2

12 | Bolger et al

bbw135-TF1

bbw135-TF2

bbw135-TF1

bbw135-TF2

Documents

Plant genome and transcriptome annotations: from ... · Popular transcrip-tome assembly tools such as TRINITY [8] require significant op-timization to produce an assembly of reasonable