Upload
fiona-nielsen
View
402
Download
2
Embed Size (px)
Citation preview
Genome sharing projects around the world
– and how you find data for your research
Fiona NielsenLunteren, April 18 2016
Slides will be made available online
Follow us on twitter:@repositiveio
Fiona Nielsen, April 18 2016
Find me on twitter: @glyn_dk
1. What data are you looking for? And Why?
2. Data resources from around the world3. Tips on how to find and access data4. Hands-on using Repositive
5. Summary and feedback
Workshop outline
1. What data are you looking for?
This workshop will focus on finding and accessing human genomic data.
… And why would you be looking for genomic data for your research?
Are you researching cancer or genetic diseases?
How much data do you need to publish a paper?
2001: 1 human genome
2012: 1000 Genomes (1092 genomes, since increased to ~2500)
2015: UK10K, Icelandic population (2,636 + 100k imputed), Cancer genome atlas ~11,000 genomesExac consortium 65,000 exomes
?
Statistically speaking, you still need 10s of thousands of samples for validation
The more severe the phenotype and the more complete penetrance, the easier it will be for you to find your variant, but
“As the genetic complexity of the disease increases (for example, reduced penetrance and increased locus heterogeneity), issues of statistical power quickly become paramount.” http://
www.nature.com/nrg/journal/v15/n5/full/nrg3706.html
But I am just looking at this one disease…
What can I do?
PRO TIP: involve a statistician early on in your study design!
How can I determine significance?
“One potentially powerful approach is to assess conservation across and within multiple species as whole-genome sequence data become more abundant.”
Look at extreme phenotypes “Sampling cases or controls from the extremes of an appropriate quantitative distribution can often increase power”
Look at non-SNP variants, they are more likely to have functional effects
- “how to account for the technical features of sequencing, such as incomplete sequencing and biased coverage over the genome?”
Think of how you can provide evidence that your result is not just a local technical variation or sampling bias
e.g. data from same cell type, same seq technology, same alignment…
How to account for bias?
PRO TIP: include more reference data in your analysis
• Know what data is available in your lab, your dept, your org
• Survey from Qiagen showed that one of the main reasons researchers collaborate is to get access to data!
How can I access more data for my research?
How can I find collaborators?
PRO TIP: Search for collaborators who have the data you need
PRO TIP: Tell your colleagues and peers what type of data you have in your lab
2. Data resources from around the world
public repositories• some you apply for access,
especially if data contains clinical info or whole genome PID
• some are open access: GEO, SRA, PGP, OpenSNP, GigaDB, …
• some are consented for general research use, some have specific consent
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Large amounts of data, but not accessible
≈ .5 PB Sequence available
80+ PB
Sequenced every year
WGS data available in public repos
Exponential growth rate
Under-utilised data has huge potential for
medical research
DATA is fragmented
It may be confusing
Hundreds of data sources…but they aren’t easy to find!
Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-160
20406080
100120140160180200
1025 33 35
102
163
http://dx.doi.org/10.1371/journal.pbio.1002418 First 30 data sources listed here:
Data source content
Assay Types
Dedicated to…
Number of samples in Data sources
0.2
2
20
200
2000
20000
200000
2000000
Chart Title
Sam
ple
# (L
og10
)
Top 5:GEO (1.8M)PMI Cohort Program (1M)Auria Biopankki (1M)EGA (~0.6M)SRA (~0.5M)
Data accessibility
Can download the data straight away or after logging in.
Need to apply for access to the data.
Has both Open and Restricted access data within one repository.
Online Data source ’types’University – Affiliated to a university. Often only members of that university can upload/download to/from it. Catalogue – doesn’t have raw
data but lists studies/datasets.
Initiative/Consortium – Has a specific purpose/aim. Often focussed on a question or disease.
Repository – Can download from, has data from multiple institutions. Often can also upload your own data there.
Company – For profit organisation. Listing data is not their main purpose.
Biobank – many have sequence data of their biological samples.
Sequenced ethnicities
Aboriginals
African Americans
Africans
Australians
Chinese
MalaysIndians
DanishDutch Estonian
Russian
European Ancestry
FinnishIcelandic
JapaneseKorean
Latin Americans
Saudi
Swedish
Machines & Data sources
9475600
88
660
26
68
5062
3
25
0
0
23 International
Interesting site to look at: http://omicsmaps.com/stats
Main Repository funders
BGI = 4
EBI = 9NIH = 10NCBI = 9
The Broad = 8
Wellcome = 4
EBI total 104 services, 19 repositories http://www.ebi.ac.uk/services/all
NCBI total 67 databases http://www.ncbi.nlm.nih.gov/guide/all/#databases_
• Case study: DNA data on Cancer
3. Tips to find and access data
Case Study – DNA data on Cancer
Repositories youhave heard of:
Ask around (word of mouth):
Repository Data Type AccessArrayExpress Expression Open
GEO Espression Open
EGA Mixed Restricted
dbGaP Mixed Restricted
Encode Healthy Reference Open
1000 Genomes Healthy Reference Open
Repository Data Type AccessCOSMIC Somatic mutations & WGS Open
ClinVar Variant information Open
ExAC Allele Freq. but not raw data Open
SRA Individual sequences Open
TCGA Clinical & high level data Open
CGHub Low level data (DNA data) Restricted
Case Study – DNA data on Cancer
We have identified the first 27 cancer specific data sources
And many more that contain cancer data alongside other data types.
AbcodiaAmbryShareBRCA ExchangeBreast Cancer Now Tissue BankBroad Cancer programme datasetsCancer Moonshot 2020CanGEMCGCICGHubChinese cancer genome consortiumChinese national human genome centreFollicular Lymphoma Genome DataG-DOCGenoMelICGCNational Mesothelioma Virtual BankNCIP Hub
Project GENIETargetTCGATexa cancer research biobankNCI-60CCLECOSMICFantomcancer methylome systemCancer therepeutics response portal
1. Register for eRA account
2. Request access to specific dataset of interest
3. Download data
Registering for CGHubhttps://cghub.ucsc.edu/keyfile/newuser.html
‘Principle signing official’ registers Email to verify
Email to confirm/deny access
to website
Email with temporary password
Change password Electronic signature
Login Fill in contact info,
Complete ‘424’ form (research application
form)Request reviewed by
DAC
Email to confirm/deny access
to data
Login Retrieve personal access token
Download!
Often a long process
Bottlenecks: • Finding relevant and usable
data• Getting authorisation to
access data• Formatting data• Storing and moving data
We studied the problem by qualitative interviews followed by a survey of researchers in
human genetics
Often a long process
T. A. van Schaik et alThe need to redefine genomic data sharing: a focus on data
accessibility, Applied & Translational Genomics, 2014
10.1016/j.atg.2014.09.013
Researchers spend months to find and access genomic data, and often choose to not access
data at all
Why the barrier?
Why the barrier?
• Benefits: strict governance, review of consent, applicant signs for full responsibility for governance
• Disadvantages: No control of data once access is given, high barrier for access – too high?
• Start planning your data needs early in your project• When you find the data you need, start application• Use Open Access data
How can I save time?
PRO Tip: If you use human genomic data, apply for the GRU datasets in dbGaP, one application – access to all the GRU datasets
• Some data is Open Access requires specific consent
• OpenSNP.org (Bastian)• Personal Genomes Projects• Individuals who put their genomes online, e.g. Manuel Corpas
and his family “the Corpasome”
• http://manuelcorpas.com/about/
Not all data is restricted
• Some data is Open Access requires specific consent
• Individuals who put their genomes online, e.g. Manuel Corpas and his family “the Corpasome”
• http://manuelcorpas.com/about/
• OpenSNP.org • Personal Genomes Projects
Not all data is restricted
Personal Genome ProjectPGP Harvard PGP Canada PGP UK Genom Austria
Host institution Harvard Medical School Boston
SickKids Toronto University College London CeMM Research Center for Molecular Medicine
Principal Investigator George Church Steven Scherer Stephan Beck Christoph Bock &Giulio Superti-Furga
Launch year 2005 2012 2013 2014Geographic scope USA, mainly Boston Canada United Kingdom Mainly Austria
Enrollment eligibility At least 18 years old, able to make an informed decision, perfect score in the PGP enrollment exam, certain vulnerable groups excluded
Data Generated Whole genome sequencing, upload of additional data possible
Mainly whole genome sequencing
Whole genome sequencing, DNA methylome sequencing, RNA transcriptome sequencing
Mainly whole genome sequencing
Number of genomes 100s 10s 10s 10sData access
http://personalgenomes.org/harvard/data http://genomaustria.at/unser-genom/#genome-der-pionierinnen
Project funding Discretional funds and corporate sponsoring
Institutional startup funds Discretional funds and corporate sponsoring
Institutional startup funds
Areas of emphasis Integration with phenotypic data, collaboration with other personal omics initiatives
Genome donations, synergy with massive-scale clinical genome sequencing projects
Genomes and society, genetic literacy, school projects, education
Website http://personalgenomes.org/harvard/ http://personalgenomes.org/canada/ http://personalgenomes.org/uk/ http://genomaustria.at/
Summary of data access barriers
Data is uploaded to repository
Data is discovered by potential user
Data is accessed by potential user
• “even when researchers are authorised to share data they report reluctance to do so because of the amount of effort required“ http://www.sciencedirect.com/science/article/pii/S2212066114000386
• “Clinical geneticists cited a lack of time because their main priority is diagnosing patients. Industrial researchers cited a lack of time because of the pressure to meet the deadlines in their job. Researchers in academia cited both a concern about the potential loss of future publications once unpublished data is shared, and the lack of time and incentive to share data as this does not contribute to their publication record. Researchers from all categories felt that they lacked sufficient resources to make their data available.”
The barrier of making data available
But I do not want to share my data
• If you expect data to be available to you – you have to make your data available too!
• Encourage collaborations: power by numbers
1. Get credit – publish and make your data available2. Give credit – cite data sources3. Understand consent – for all uses of clinical data
Best practices
• Use all available tools to make your life easier: • Data publications visibility and citations for your data, e.g.
GigaScience and Scientific Data
• Figshare, Zenodo, Dryad for sharing open access data
• PhenomeCentral, Matchmaker exchange for rare disease research
• Repositive for finding data across repositories and make your own data discoverable
Best practices: use the tools
Does data sharing matter atgrant proposal evaluation?
Based on: Winning Horizon 2020 with Open Science, http://dx.doi.org/10.5281/zenodo.12247
Best practices: Plan into your grant proposals
“Weakness: Involvement of non-academic beneficiaries is limited”
“Weakness: highly focused on academic activities, and lacks an advanced communication strategy”
“Weakness: limited exposure to non-academic partners & infrastructures”
Excellence
Impact
Implementation
“data accessibility is unclear!”
“data storage & access not considered”
Best practices: Plan into your grant proposals
“Strengths: extensive dissemination of data to the scientific community (open access, databases)”
“outreach activities to a broad audience”
“research software is freely available”
Impact:Best practices: Plan into your grant proposals
Best practices: Plan into your grant proposals
Make the (research) world a better place by sharing in return
Best practices: Share in return!
• Digital consent: towards automatic processing of applications
• Dynamic consent and power to the patient, e.g. PatientsKnowBest
• Privacy-preserving access to datasets: preserving control and governance with data custodian, lower barrier for access
What the future holds
4. Hands-on session using Repositive
What if finding data was as easy as finding a book on Amazon, book a hotel on Expedia?
Repositive promotes best practices
Discover new data sources
EASY SEARCH
Repositive promotes best practices
Make your data visible
SHARE KNOWLEDGE
Repositive promotes best practices
Build a data community
BUILDTRUST
Benefit for both sides of data collaboration
Data consumers Data producers
Find relevant data faster
Feedback from other users through ratings and comments to evaluate data quality
Find collaborators with data
Make your data visible
Build credibility as a trusted provider of quality data
Find collaborators to analyse your data
5. Summary and feedback
• Get credit – publish data• Give credit – cite data• Understand consent
Tell us your thoughts: @repositiveio
@glyn_dk
And read more on http://repositive.io
Thank you!