Data as research output;�Data as part of the scholarly record
Todd Vision University of North Carolina at Chapel Hill Dryad Digital Repository
SciELO15 � 24 October 2013 � São Paulo
CC-‐BY-‐NC-‐SA nic221 h/p://www.flickr.com/photos/nic221/391536867/
Source: IFEX h/p://www.ifex.org/united_states/2013/09/05/cipa_libraries/
2011 BURLEIGH ET AL.—INFERRING THE PLANT TREE OF LIFE FROM GENE TREES 121
TABLE 1. Summary of supertree bootstrap support from the GTPanalysis
Clades With Bootstrap Support (%)Numberof Taxa 100 !90 !70 !50
136-Taxon Cons. 136 9.8 30.8 56.4 74.4Reduced Cons. 82 50.6 70.9 89.9 98.782-Taxon Cons. 82 53.1 72.2 84.8 96.2
Notes: This displays the percentage of total clades at or above a givenlevel of bootstrap support for 1) the majority rule consensus of all boot-strap trees from the NJ/PP analysis of 136 taxa (136-Taxon Cons.), 2)the reduced consensus of all bootstrap trees for the 82 taxa present inat least 1300 of the gene trees (Reduced Cons.), and 3) the majority ruleconsensus of all bootstrap trees from the NJ/PP analysis of the same82 taxa as above (82-Taxon Cons.).
plants (99% support), gymnosperms (100% support),angiosperms (99% support), eudicots (99% support),core eudicots (99% support), and asterids (100% sup-port; Fig. 3). Within gymnosperms, Gnetales were sisterto the conifers (100% support; Fig. 3). Amborella was sis-ter to all other angiosperms, and Nuphar (Nympheales)was sister to all angiosperms except Amborella (Fig. 3).Magnoliids were sister to a monocot + eudicot clade(Fig. 3). Within monocots, the Poaceae (grass family)had 100% support, and within the grasses, the Pani-coideae clade had 100% bootstrap support (Fig. 3). Inthe core eudicot clade, the Caryophyllales (100% sup-port) were sister to the rosids (99% support) and theasterids (100% support) (Fig. 3).
There were several differences in the species tree ob-tained using ML gene trees versus NJ/PP gene trees. Forexample, the relationships among eurosid lineages dif-fered slightly; however, in both analyses, Malpighiales
FIGURE 2. Average quartet similarity for each taxon among boot-strap trees. Each point in the graph represents a single taxon. The x-axis shows the number of gene families trees that have data from thetaxon. The y-axis shows the average percentage frequency of quartets(four taxon statements) containing the taxon that are identical betweentwo bootstrap trees. The shaded area in the graph contains all taxa thatare present in less than 1300 gene trees.
(eurosid I) were nested in a clade with eurosid II taxa(Figs. 1 and 3). The BEP-clade (Bambusoideae, Ehrhar-toideae, and Pooideae) was not supported in the analy-sis using NJ/PP gene trees, but it was when using MLgene trees (Fig. 3). Acorus americanus was not placedwith other monocots in the NJ/PP analysis, but it wasin a monocot clade when using ML gene trees (Fig. 3).
DISCUSSION
Frequent gene and whole-genome duplications have,in the past, limited the use of nuclear genes for deeplevel phylogenetic analyses in plants and other cladeswith highly duplicated genomes. GTP provides a way toexploit the phylogenetic information inherent not onlyin the relationships among orthologous genes but alsothe rare gene duplications that produce paralogous genefamily members. Rather than treating gene tree discor-dance as a nuisance, it seeks the species tree that pro-vides the best reconciliation among the many discordantgene trees.
In this study, we used GTP to find species trees thatminimize the total number of duplications across a col-lection of nearly 18,896 plant gene trees. The sequencesampling includes extensive collections of existing ESTdata that have rarely before been used for plant phylo-genetics (but see de la Torre et al. 2006; Sanderson andMcMahon 2007; de la Torre-Barcena et al. 2009). Thus,this study provides a new nuclear genomic perspectiveon the plant tree of life.
Overall, the phylogenetic relationships inferred fromgene duplications are largely consistent with previouslarge-scale molecular studies of plant phylogeny (e.g.,Soltis et al. 2000; Hilu et al. 2003; Jansen et al. 2007).Yet the GTP analysis also provides support for somerelationships that are unresolved or conflicting in pre-vious analyses. For example, the results support theplacement of magnoliids sister to monocots + eudicots,making eudicots (possibly with Ceratophyllum, whichwas not included in this study) sister to monocots (Figs 1and 3). The relationships among these major clades areunclear from analyses using few genes (e.g., Soltis et al.2000, 2007; Hilu et al. 2003), but our result is consistentwith recent analyses using 81 plastid genes (Jansen et al.2007). The placement of Malpighiales within a eurosidI clade (Figs. 1 and 3) generally conflicts with previouslarge-scale angiosperm analyses (e.g., Soltis et al. 2000;Hilu et al. 2003; Jansen et al. 2007). Given the novelty ofthe result, it should be interpreted with great caution.
Our results indicate that data from many gene treesmay be required to produce a well-supported phy-logeny using GTP (Table 1; Figs. 2 and 3), suggestingthat GTP may not use data as efficiently as more tradi-tional phylogenetic analyses of concatenated multigenedata sets. For example, in plants, recent analyses of upto 83 plastid genes have apparently resolved enigmaticrelationships in the backbone angiosperm phylogeny,whereas our analyses appear to require data from >1000genes (Jansen et al. 2007; Moore et al. 2007, 2010). Like
at University of North Carolina at Chapel Hill on February 18, 2011sysbio.oxfordjournals.org
Downloaded from
or sum
!int3, i !Xn
j!0
zij
of the transformed scores for gene i.The fourth method, referred to as int4, differs from the other
three by considering both the score of a gene within a data sourceas well as the number of genes returned for that data source. First,a transformed score sij is obtained.
sij !rijPni!0 rij
The transformed gene scores are then summed together to providea final score for each gene.
!int4, i !XJ
j!1
sijgjG
where gj is the number of genes returned for source j and
G !XJ
j!1
gj
2.4 Implementation
The CAESAR algorithms were written using Perl version 5.8.1and Java version 1.4.2. The vector space similarity searches wereperformed using a modified version of the Perl module
Search::VectorSpace by Maciej Ceglowski (http://www.perl.com/pub/a/2003/02/19/engine.html). Databases and ontology schemas weredownloaded and parsed into XML under a custom XML schema.Intermediate text and data-mining results were also stored as XMLunder the same schema.
2.5 Selection of the tests for complex traits
To assess the ability of CAESAR to choose valid candidates, 18 testgenes were selected from recently published reports providing strongevidence of statistical association with known complex humandisorders. The test genes included CTLA4 (Ueda et al., 2003),PTPN22 (Bottini et al., 2004), PTPN22 (Begovich et al., 2004),SUMO4 (Guo et al., 2004), FCRL3 (Kochi et al., 2005), ENTH(Pimm et al., 2005), EN2 (Gharani et al., 2004), TCF7L2 (Grant et al.,2006), CFH (Klein et al., 2005), LOC387715 (Rivera et al., 2005),LTA4H (Helgadottir et al., 2006), C2 (Gold et al., 2006),CFB (Gold et al., 2006), NPSR1 (Laitinen et al., 2004), MYO9B(Monsuur et al., 2005), IL2RA (Vella et al., 2005), SEMA5A(Maraganore et al., 2005) and LOC439999 (Grupe et al., 2006).
Each disorder required a custom corpus, either an OMIM recordor one or more review articles describing the biology of the disorder(Table 2). Review articles were selected by searching PubMed(Wheeler et al., 2006) for articles published before the year of discoveryof each gene association. Where multiple suitable review articleswere available, the texts were concatenated to produce the corpus.We removed any direct reference to the testing gene in the input text.In addition, entries in the GAD containing the test genes were removed.Thus, the input data closely mimicked the state of knowledge prior
Table 2. Tests using susceptibility genes for complex human traits
Complex trait OMIM Review(s)a Geneb Reviews OMIM
Rank Total Percent Enrich Rank Total Percent Enrich
Age-related macular 603075 15094132; 15350892 CFH 7263 13771 47.3 2 10450 12608 17.1 1degeneration LOC387715 – 13771 – – – 12608 – –ARMD (second run) 603075 N/Ac C2 – – – – 766 12875 94.1 17
CFB – – – – 44 12875 99.7 293Alzheimer’s disease 104300 15225164 LOC439999 – 13550 – – – 13709 – –Asthma 600807 12810182; 14551038 NPSR1 1117 13881 92.0 12 2835 13120 78.4 5Autism 209850 11733747; 12142938 EN2 98 13610 99.3 139 98 13213 99.2 135Celiac disease 212750 12907013; 12699968;
14592529MYO9B 234 13039 98.2 56 168 12703 98.7 76
Myocardial infarction 608446 15861005; 16041318 LTA4H 122 14043 99.1 115 –d – – –Parkinson’s disease 168600 16026116; 16278972 SEMA5A 4548 13477 66.2 3 879 13329 93.4 15Rheumatoid arthritis 180300 15478157; 12915205 PTPN22 333 13279 97.5 40 2156 13038 83.5 6
FCRL3 3743 13279 71.8 3 2230 13038 82.9 6Schizophrenia 181500 15340352; 16033310 ENTH 10013 14603 31.4 1 8065 13572 40.6 2Type 1 diabetes mellitus 222100 12270944; 11921414 SUMO4 12123 14272 15.1 1 7675 13130 41.5 2
11237226; 11899083 PTPN22 165 14272 98.8 86 833 13130 93.7 16IL2RA 130 14272 99.1 110 528 13130 96.0 25CTLA4 78 14272 99.5 183 324 13130 97.5 40
Type 2 diabetes mellitus 125853 15662000; 15662001; TCF7L2 2911 13922 79.1 5 4013 13586 70.5 315662002; 15662003
Totals 725e 13826e 94.7e 54f 879e 13130e 93.4e 43f
aPubMed ID(s) of review articles used in corpus.bFor references see Methods section. HUGO approved gene symbols used to identify genes.cNo suitable review corpus available (see Methods section).dThe OMIM record is insufficiently detailed and was not used.eMedian result.fMean result.
A computational system to select candidate genes for complex human traits
1135
Substitution Patterns in RRE
We examined 3 possible explanations for the surpris-ing result that jp , ju in RRE. First, because both the RREand CRE secondary structures occur within coding regions,we examined the possibility that the difference between jpand ju is diminished by selection on the protein sequence.We recalculated jp and ju for both molecules using onlydata from 4-fold degenerate sites in paired and unpaired re-gions. In CRE, the presence of codons affects the estimatesin the predicted direction (4-fold degenerate sites:jp5j2:89u ; all sites: jp5j1:45u ), though the 4-fold sites over-shoot the predicted pattern. We had less power to compare4-fold degenerate sites at the paired and unpaired sites ofRRE because there were too few 4-fold degenerate unpairedsites, and there was insufficient sequence variability at thesesites. However, the 4-fold degenerate paired sites did show
a higher jp (jp 5 7.61 with 95% CI [4.79–18.48]) than thepaired sites as a whole (jp 5 4.21 with 95% CI [3.51–5.28]). This suggests that the presence of protein-codingconstraints does impede compensatory evolution at pairedsites in RNA secondary structures, although it does not ex-plain why ju would be ‘‘greater’’ than jp in RRE.
Second, we examined the possibility that we had useda nonrepresentative sample of RRE sequences. To confirmthat the observed substitution patterns in RRE were not spe-cific to the particular set of HIV sequences we examined(which were all derived from subtype B), we estimatedjp and ju from 2 additional RRE alignments of sequencesdrawn from higher taxonomic levels: sequences from dif-ferent subtypes (1 sequence each from A, B, C, F, G, H, J,and K) and sequences from different groups (1–2 sequenceseach from M, N, and O) of HIV. In both these alignments,the results were qualitatively similar to those for subtype B:ju was significantly higher than jp (table 4).
Third, we considered whether the RRE estimates weredisproportionately influenced by a portion of the moleculethat experiences a type of selection that differs from themolecule as a whole. We systematically removed eachstem-loop of RRE and reestimated jp and ju for the result-ing partial structures. The jp and ju estimates were quali-tatively similar for all these partial structures (table 5).
FIG. 2.—Best-fit nucleotide substitution models for each alignment. Shown is a cartoon illustration of the rate categories of the best-fit nucleotidesubstitution models for each molecule. Within a molecule, rates were scaled to the maximum rate (black). Diagonal lines depict transitions; the edges ofthe square depict transversions. The HKY85 model, which was used for the rate ratios reported throughout this article, is shown for comparison onthe right.
FIG. 3.—Transition–transversion rate ratios (j) for each alignment.The dotted line represents a 1:1 relationship between jp and ju. The solidline represents the predicted relationship jp5j2u. Note that the CRE datapoint is from the analysis of 4-fold degenerate sites in paired and unpairedregions.
Table 3Transition–Transversion Rate Ratios (jp)
Structure j jp ju k
RRE 5.19 4.21 9.01 546.05a
IRES 6.50 15.34 3.60 73.46a
CRE 12.52 22.36 2.93 177.32a
5S rRNA 3.70 4.44 2.82 35.05a
16S rRNA 3.24 3.79 2.02 665.64a
23S rRNA 2.57 3.06 1.71 1281.71a
A tRNA 6.04 9.48 3.30 204.73a
M tRNA 11.98 18.78 9.65 122.24a
12S rRNA 3.90 6.69 2.83 131.93a
RNase P 2.98 4.86 1.30 59.21a
a LRT value significant at P , 0.0001
1784 Knies et al.
Published tables & figures
Analysed data
Raw data
Relatively little data is published within articles
Reuse of open data boosts citations to the original article
Piwowar and Vision (2013) doi:10.7717/peerj.175
Most analyzed data is in the ‘long tail’, for which there is no specialized repository
Volu
me
Rank frequency of datatype
Structured data (e.g. Genbank, GBIF)
Long-tail data
After Heidorn (2008) doi:10.1353/lib.0.0036
Peer-to-peer data sharing does not work
Wicherts and colleagues requested data from from 141 articles in American Psychological Association journals.
“6 months later, after … 400 emails, [sending] detailed descriptions of our study aims, approvals of our ethical committee, signed assurances not to share data with others, and even our full resumes…” only 27% of authors complied
Wicherts JM, Borsboom D, Kats J, Molenaar D (2006) doi:10.1037/0003-066X.61.7.726
Inform
a(on
Con
tent
Time
Time of publica(on
Specific details
General details
Accident
Re(rement or career change
Death
(Michener et al. 1997)
Data is best captured at the time of publication
Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow, Passer domesticus. Biological Lectures from the Marine Biological Laboratory: 209-226.
CC-‐BY Adamo h/p://www.piqs.de/fotos/121272.html
Joint Data Archiving Policy ( JDAP ) Data are important products of the scientific
enterprise, and they should be preserved and usable for decades in the future.
As a condition for publication, data supporting the results in the article should be deposited in an appropriate public archive.
Authors may elect to embargo access to the data for a period up to a year after publication.
Exceptions may be granted at the discretion of the editor, especially for sensitive information.
http://datadryad.org/pages/jdap
Piwowar HA, Chapman WW (2008) hdl:10101/npre.2008.1700.1
High impact factor journals have stronger data archiving policies
n=70
IF=3.6
IF=4.5
IF=6.0
published data (with article citation)
published article(with data citation)
DRYAD
JOURNAL
prepare manuscript and related data files
submit manuscript
editor
manuscript review
curation
send articledescription
Dryad data package
accepted?
yesno
send data identifier (DOI)
author
accepted?
data curator
upload data
When using this data, please cite the original article: Chave J, Coomes D, Jansen S, Lewis SL, Swenson NG, Zanne
AE (2009) Towards a worldwide wood economics spectrum. Ecology Letters 12: 351-366. doi:10.1111/j.1461-0248.2009.01285.x
Additionally, please cite the Dryad data package: Zanne AE, Lopez-Gonzalez G, Coomes DA, Ilic J, Jansen S,
Lewis SL, Miller RB, Swenson NG, Wiemann MC, Chave J (2009) Data from: Towards a worldwide wood economics spectrum. Dryad Digital Repository. doi:10.5061/dryad.234
No fees for submission from low and lower middle income countries
Dryad by the numbers
Data packages 4,172 Authors 15,581 Data files 11,912 Integrated journals 37 All journals 268 File downloads 4,629,256
Stats as of 23 Oct 2013
To learn more
• Repository home: http://datadryad.org • News: http://blog.datadryad.org
• Project documentation: http://wiki.datadryad.org • Twitter: @datadryad
• Code: http://code.google.com/p/dryad
or contact us:
• http://datadryad.org/feedback
• Todd Vision, Director, [email protected] • Laura Wendell, Dryad Executive Director, [email protected]