Sharing re-usable phylogenetic data: we're not there yet

Preview:

DESCRIPTION

My talk given at TDWG (Florence, Italy), 9am 31st October 2013

Citation preview

Sharing reusable phylogenetic data: we're not there yet

Ross Mounce

@rmouncehttp://orcid.org/0000-0002-3520-2046

A talk of two halves

1.) Outlining the extent of the problem

(lack of) sharing, standards, care (?)

2.) What I'm trying to do about it:

Digging data out of PDFs

Re-releasing as

Just ~4% of published phylogenetic studies in 2010publicly archived their supporting phylo data in

Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie E, Kumar S, Rosauer D, & Vos R. 2012 Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis

BMC Research Notes 10.1186/1756-0500-5-574

Where's the data?

Check our data yourself on Dryad here: 10.5061/dryad.h6pf365t

Scientists cannot be relied upon to share published data upon request

This has been known for a while nowe.g. (in Psychology) Wicherts et al 2006

But has been confirmed to be true for phylogenetics too:

Drew et al 2013 'Lost Branches in the Tree of Life'

report that just ~16% of researchers contacted supplied

the requested ('published') phylo data.

My own experience tallies with this – I soon stopped bothering to try and ask people via email for a copy of their published data. It's a waste of time.

The (Single) Supplementary Data Filewas a Y2K solution – a dump

ResearchData

Many legacy journal supplementary data systems bury data and leave it there to decompose

Often not re-usable in form e.g. a lazy PDF

Sometimes 'typeset', corrupting the data

A jumble of words & data where the bit you want is on page 92 (no programmatic access)

BURIED and really not very discoverable

Do reviewers even look at it? I think not tbh

I wasted too much of my PhD trying to get usable data to re-analyze

This is what I felt like... So I tried to do something about it...

www.supportpalaeodatarchiving.co.uk

An open letter in support of palaeontology data archiving

Which was picked-up by Nature NewsWhich, in turn got me in touch with:

Part 2

Since few will help you to re-use their data

You've got to dig it out and

make it re-usable yourself

ANDre-release it openly

so no-one else wastes their time doing this

It's not just phylogenetics.

I learned from the Open Knowledge Conference (Berlin 2011)that a lot different academic fields seem also struggle to make re-usable published data available.

If it's a common, shared-problem... why not seek a shared, cross-disciplinary solution?

AMI (Amanuensis)

Building upon tools first developed in computational chemistry by the Murray-Rust lab

e.g.

ChemicalTagger → PhyloTagger (Entity tagging)(Chem)PubCrawler → (Phylo)PubCrawler

(to getting 10,000+ PDFs to work on)

https://bitbucket.org/nickday/pub-crawlerhttp://www-ucc.ch.cam.ac.uk/products/software/chemicaltagger Open Source

BBSRC grant approved

“PLUTo: Phyloinformatic Literature Unlocking Tools”

Software for making published phyloinformatic data discoverable, open, and reusable

...I just need to get my PhD viva done & rubber-stamped

Instructions for getting the current working setup here:(multiple repositories, dependencies & requirements!)

http://rossmounce.co.uk/2013/10/06/setting-up-ami2-on-windows/

Evolution of ultraviolet vision in the largest avian radiation - the passerines Anders Ödeen 1* , Olle Håstad 2,3 and Per Alström 4

PDF

HTML

Styles , superscriptsAnd diåcritics preserved!

AMI

PDF

Turdus iliacusTaeniopygia guttataSerinus canariaLanius excubitorMelopsittacus undulatusPavo cristatusSturnus vulgarisDolichonyx oryzivorusFicedula hypoleucaVaccinium myrtillusFalco tinnunculus

TurdusPomatostomus LeothrixAmytornis AcanthisittaOrthonyx x 2MalurusCnemophilus x 4Philesturnus x 2Motacilla x 2Toxorhampus x 2

Typical phylo tree: 60 nodes, complex and miniscule annotation, vertical text, hyphenation and valuable branch lengths. AMI extracts ALL

Acanthisittidae Acanthizidae Acrocephalidae Callaeidae Campephagidae Cnemophilidae Corvidae

0.84 0.91 0.93 0.95

Acanthisitta Acrocephalus Ailuroedus Ailuroedus Amytornis Camptostoma

AMI23.1234.5437.2138.55

Posteriorprobability

Branch lengths

NexML

Genus Family

HTML

Acknowledgements & Thanks

For travel & accommodation support, without which I couldn't possibly attend TDWG

For the Panton Fellowship,inspiration and support

To the organisersof both the session:Nico, Hilmar, Rutgerand the conferenceas a whole!

My main collaborators on PLUTo: Matthew Wills and Peter Murray-Rust

Recommended