SciDataCon: How to increase accessibility and reuse for clinical and personal genomic data
Fiona Nielsen – September 12th 2016
We are always looking for data
Genetics, Cancer,
Rare diseaseresearch
We need access to the right data at the right time
DNA interpretation
requires lots of data
How much data do you need to publish a paper?
2001: 1 human genome
2012: 1000 Genomes (1092 genomes, since increased to ~2500)
2015: UK10K, Icelandic population (2,636 + 100k imputed), Cancer genome atlas ~11,000 genomes
?
2016:Exac consortium 65,000 exomes
2020:
Data is not easy to find and access
FRAGMENTEDPoor visibility of available
genomic data
ADMIN BURDENHuge overhead to manage
data access
BAD CULTURELack of data sharing habits in
research culture
Finding and accessing data can take months
< 1 week
1-3 months
+6 months
40%
48%
11%
Time spent data scouting per project
Why the barrier?
Barriers
• Difficult to find data, let alone find the RIGHT data
• Time-consuming and difficult to apply for access to data
• Complicated and labourious to submit data to public repositories
http://blog.repositive.io/tag/data-access/
http://blog.repositive.io/tag/data-sharing/
Data access applications for sensitive data
• Benefits: strict governance, review of consent, applicant signs for full responsibility for governance
Data access applications for sensitive data
• Benefits: strict governance, review of consent, applicant signs for full responsibility for governance
• Disadvantages: No control of data once access is given, high barrier for access – too high?
Alternative process – castle and moat
• Vetted users are allowed into the system where they can investigate and analyse data.
• No raw data exports are allowed and results for export are manually reviewed
• Example: Genomics England
• Allow vetted users access to privacy-preserving or manually curated exports from the data
• Example: Browsing UK census data – available for all
Alternative process – controlled disclosure
Read about our pre-competitive PDX data resource in collaboration with AstraZeneca http://repositive.io/pdx
But where in the world is the data?
?
Building upon best practices
MAKE DATA DISCOVERABLE
SIMPLIFY WORKFLOWS
CONTRIBUTE TOCOMMUNITY
DNAdigest and Repositive – Connecting the world of genomic datahttp://www.tinyurl.com/plos-biology-repositive
How to make data easy to discover?
Although there are hundreds of data sources… they aren’t easy to find!
Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-160
20406080
100120140160180200
1025 33 35
102
163
http://dx.doi.org/10.1371/journal.pbio.1002418 First 30 data sources listed here:
Sequenced ethnicities
Aboriginals
African Americans
Africans
Australians
Chinese
MalaysIndians
DanishDutch Estonian
Russian
European Ancestry
FinnishIcelandic
JapaneseKorean
Latin Americans
Saudi
Swedish
Machines & Data sources
9475600
88
660
26
68
5062
3
25
0
0
23 International
Interesting site to look at: http://omicsmaps.com/stats
Main Repository funders
BGI = 4
EBI = 9NIH = 10NCBI = 9
The Broad = 8
Wellcome = 4
EBI total 104 services, 19 repositories http://www.ebi.ac.uk/services/all
NCBI total 67 databases http://www.ncbi.nlm.nih.gov/guide/all/#databases_
We have identified hundreds of data sources
Universities – Or repositories affiliated to a university.
Projects/Consortia – Has a specific purpose/aim. Often focussed on a specific research question or disease.
Public repositories – Allows download and upload of data from multiple institutions.
Companies – For profit organisations making data available for free or as a service.
Biobanks – many have sequence data of their biological samples.
Researchers know on
average 4-5 data sources
More data sources appear every day, to date we have identified 270+
Simpler workflowfor data access
And indexed them on a the Repositive platform
Discover and access
Efficient Search, see related results
Find colleagues & their data interests
Co-annotate data & community feedback
Free to use: http://discover.repositive.io
Benefit for both sides of data collaborations
Data consumers Data producers
Find relevant data faster
Feedback from other users through ratings and comments to evaluate data quality
Find collaborators with data
Make your data visible
Build credibility as a trusted provider of quality data
Find collaborators to analyse your data
• Supporting the whole research workflow
• Faster, more efficient data discovery• Streamlining data access applications • Developing technology for efficient data access• Setting up pre-competitive data sharing agreements• Running workshops and training programmes
More efficient data access
Read about our pre-competitive PDX data resource in collaboration with AstraZeneca http://repositive.io/pdx
Recap: Still a lot of work to do
Barriers
• Difficult to find data, let alone find the RIGHT data
• Time-consuming and difficult to apply for access to data
• Complicated and labourious to submit data to public repositories
http://blog.repositive.io/tag/data-access/
http://blog.repositive.io/tag/data-sharing/
Connecting the world of genomic data
Visit us at: http://repositive.io Or tweet us @repositiveio