49
0000-0001-6444-1436 @SCEdmunds [email protected]

Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Embed Size (px)

DESCRIPTION

Scott Edmunds Innovating Communication in Scholarship talk at UC Davis: Open Publishing for the Big Data era, 30th October 2014

Citation preview

Page 1: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

0000-0001-6444-1436

@SCEdmunds

[email protected]

Page 2: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Challenges/Opportunities in the Data-Driven Era

Quick response to climate change, food security & disease outbreaks

Using networking power of the internet to tackle problems

Can ask new questions & find hidden patterns & connections

Build on each others efforts quicker & more efficiently

More collaborations across more disciplines

Harness wisdom of the crowds: crowdsourcing, citizen science,

crowdfunding

Enables:

Enabled by:Removing silos, standards/formats, open-access/data

Big Challenges:

Page 3: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

What do publishers do?

Page 4: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

What do publishers do?

Apologies: http://scholarlykitchen.sspnet.org/2014/10/21/updated-80-things-publishers-do-2014-edition/

the scholarly chicken

(tl;dr version)

Page 5: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

The problems with publishing

• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995

• Traditional publishing policies and practices a hindrance (licensing & access, embargoes, Ingelfinger, closed doors, anti-granularity & forking)

• Lack of transparency, lack of credit for anything other than “regular” dead tree publication.

Page 6: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Are publishers really adding value?

1. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124

Page 7: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

The consequences: growing replication gap

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Page 8: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Consequences: increasing number of retractions>15X increase in last decade

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

Page 9: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Consequences: increasing number of retractions>15X increase in last decade

At current % > by 2045 as many papers published as retracted

1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950

Page 10: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

STAP paper demonstrates problems:

Nature Editorial, 2nd July 2014:

“We have concluded that we and the referees could not have detected the problems that fatally undermined the papers. The referees’ rigorous reports quite rightly took on trust what was presented in the papers.”

http://www.nature.com/news/stap-retracted-1.15488

Page 11: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

STAP paper demonstrates problems:

…to publish protocols BEFORE analysis…better access to supporting data…more transparent & accountable review

…to publish replication studies

Need:

Page 12: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

The solutions for publishing?

1. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124 2. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.1001747

Page 13: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

1. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.1001747

Page 14: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

• Data• Software• Review• Re-use…

= Credit

}

Credit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)

New incentives/credit

Page 15: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Rewarding open data

Page 16: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

More transparency: open peer review

Page 17: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Reward open & transparent review

Data from similar scope open/closed review journals in BMC Series shows ~5-10% harder to get referees for open review. (data from Tim Sands at BMC)

• Good data showing no difference in acceptance/rejection rates, but better quality reviews.

• Does take marginally longer to find reviewers (and for them to return reports).

BMC Series Medical Journals

Page 18: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

GigaScience + Publons = further credit for reviewers efforts

Reward open & transparent review

http://publons.com/

Page 19: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

GigaScience + AcademicKarma = even more credit

Reward faster review

http://academickarma.org/

Page 20: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Real-time open-review = paper in arXiv + blogged reviews

Reward open & transparent review

http://tmblr.co/ZzXdssfOMJfywww.gigasciencejournal.com/content/2/1/10

Page 21: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Real-time open-review = paper in arXiv + blogged reviews

Reward open & transparent review

Page 22: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Real-time open-review = paper in arXiv + blogged reviews

Reward open & transparent review

(Assemblathon ‘publish for free’ contest: [email protected])

Page 23: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

We are publishing snapshots

Page 24: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Snapshots of the research cycle

Genomic: (cats, and minipigs,and parrots, and elephants, oh my!)

Imaging: fMRI, myocardial MRI, micro-CT from worms & centipedes, sea urchin MRIs

Neurophysiology: neural activity recordings, EEG

Data, data, data…

Page 25: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Strict code availability policy in GigaScience (OSI compliant)

Publication/proof of record version archived in GigaDB

Provides extra credit & discoverability with DOI

Also link to dynamic/updating version in code repository, inc our GigaGitHub repo (https://github.com/gigascience)

Experimenting with supplemental tables in GitHub (see: https://github.com/gigascience/paper-chen2014/wiki)

25

Snapshots of the research cycleSoftware, pipelines, workflows…

Page 26: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Implement workflows in a community-accepted format

http://galaxyproject.org

Over 50,000 main Galaxy server users

Over 1,000 papersciting Galaxy use

Over 60 Galaxyservers deployed

Open source

26

Snapshots of the research cycle

Page 27: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

galaxy.cbiit.cuhk.edu.hk

Workflow publishing:

27

Page 28: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Visualisations & DOIs for workflows

http://www.gigasciencejournal.com/series/Galaxy 28

Page 29: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Next step: publishing VMs…

29http://dx.doi.org/10.5524/100106

Page 30: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Further beyond dead trees: Open lab books, dynamic documents

• Can facilitate reproducibility, reuse & sharing with tools like: Knitr, Sweave, iPython Notebook

• Working towards executable papers…

Page 31: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

E.g.

Page 32: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

E.g.

Page 33: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Aiding reproducibility of imaging studies

OMERO: providing access to imaging data

Already used by JCB.

View, filter, measure raw images with direct links from journal article.

See all image data, not just cherry picked examples.

Download and reprocess.

Page 34: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

The alternative...

...look but don't touch

Page 35: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

IRRI GALAXY

Beneficiaries/users of our work

Page 36: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

IRRI GALAXYRice 3K project: 3,000 rice genomes, 13.4TB public data

Beneficiaries/users of our work

Page 37: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

Our first DOI:

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Page 38: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
Page 39: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
Page 40: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
Page 41: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Downstream consequences:

“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”

1. Citations (~240) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons

4. Example for faster & more open science

Page 42: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
Page 43: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

1.3 The power of intelligently open dataThe benefits of intelligently open data were powerfully illustrated by events following an outbreak of a severe gastro-intestinal infection in Hamburg in Germany in May 2011. This spread through several European countries and the US, affecting about 4000 people and resulting in over 50 deaths. All tested positive for an unusual and little-known Shiga-toxin–producing E. coli bacterium. The strain was initially analysed by scientists at BGI-Shenzhen in China, working together with those in Hamburg, and three days later a draft genome was released under an open data licence. This generated interest from bioinformaticians on four continents. 24 hours after the release of the genome it had been assembled. Within a week two dozen reports had been filed on an open-source site dedicated to the analysis of the strain. These analyses provided crucial information about the strain’s virulence and resistance genes – how it spreads and which antibiotics are effective against it. They produced results in time to help contain the outbreak. By July 2011, scientists published papers based on this work. By opening up their early sequencing results to international collaboration, researchers in Hamburg produced results that were quickly tested by a wide range of experts, used to produce new knowledge and ultimately to control a public health emergency.

Page 44: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Nanopore MinION E. Coli genome released via GigaDB 10-Sep-2014

Curated & converted to ISA-tab, & worked with EBI to get raw data there

Data Note submitted & preprint version out 26th September

Peer reviewed & published 20th October

second

http://dx.doi.org/10.5524/100102

Page 45: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

second

Real time sequencing era needs real time publication!

• Used as test data for “minoTour”: real time data analysis tools for minION data

• Nanopore data already used in (CC0 GitHub based) teaching materials

• Next stop…poreathon! (crowdsourced v2 assembly)

1. mioTour http://minotour.nottingham.ac.uk/ 2. https://github.com/lexnederbragt/INF-BIOx121_fall2014_de_novo_assembly

Page 46: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Lessons Learned• Most published research findings are false. Or at

least have errors.

• Is possible to push button(s) & recreate a result from a paper

• Reproducibility is COSTLY. How much are you willing to spend?

• Much easier to do this before rather than after publication

Page 47: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

The cost of staying with the status quo?

• Ioannidis estimate that 85% of research resources are wasted.

• Each retraction estimated to cost $400,000.

Page 48: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Make your data & software

open (CC0, OSI)

Get credit for your reviewing

Publish your research objects

(with us!)*

In Summary

[email protected]

www.gigasciencejournal.com

@gigasciencefacebook.com/GigaScience

* Free APCs until end of 2014

Page 49: Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

Ruibang Luo (BGI/HKU)Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)

Thanks to:

@gigasciencefacebook.com/GigaScienceblogs.biomedcentral.com/gigablog/

Peter LiChris HunterJesse Si ZheRob DavidsonNicole NogoyLaurie GoodmanAmye Kenall (BMC)

Marco Roos (LUMC)Mark Thompson (LUMC)Jun Zhao (Lancaster)Susanna Sansone (Oxford)Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford)

www.gigadb.orggalaxy.cbiit.cuhk.edu.hk

www.gigasciencejournal.com

CBIITFunding from:

Our collaborators:team: Case study:

49