View
544
Download
2
Embed Size (px)
DESCRIPTION
Presentation to the Department of Biology at the University of Windsor, Windsor, Ontario. The description and update of activities related to the International Cancer Genome Consortium (ICGC)
Citation preview
A project status for the International Cancer Genome
Consortium (ICGC).
November 21th 2014
B.F. Francis Ouellette [email protected]
• Senior Scientists & Associate Director,
Informatics and Biocomputing, Ontario Institute for
Cancer Research, Toronto, ON
• Associate Professor, Department of Cell and Systems Biology,
University of Toronto, Toronto, ON.
@bffo on
2
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and respect the rights
and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
3
But first, a little about me …
… an unfinished story!
4http://goo.gl/v8G57
5http://goo.gl/WKLNr
http://goo.gl/dJIur
http://goo.gl/LwVOZ
http://goo.gl/kTGkG
http://goo.gl/sO5Na
http://goo.gl/LwVOZ
http://goo.gl/QI6aL
http://goo.gl/mYHFO
http://goo.gl/Jc5TK
15
(from the National Centre for Biotechnology Information)
from the National Centre for Biotechnology Information
16
from the National Centre for Biotechnology Information
17
from the National Centre for Biotechnology Information
PANIC
18
19
PANIC
20
PANIC
1998
2001
2004
22
1999
30http://goo.gl/ZKzMV
31http://goo.gl/ZKzMV
32
33
International Cancer Genome Consortium: icgc.org
34
http://www.csb.utoronto.ca/
35
http://bioinformatics.ca/
36
37
http://bioinformatics.ca/workshops/2014
39
CancerA Disease of the Genome
Challenge in Treating Cancer:
Every tumor is different
Every cancer patient is different
40
Johns Hopkins
> 18,000 genes analyzed for mutations
11 breast and 11 colon tumors
L.D. Wood et al, Science, Oct. 2007
Wellcome Trust Sanger Institute
518 genes analyzed for mutations
210 tumors of various types
C. Greenman et al, Nature, Mar. 2007
TCGA (NIH)
Multiple technologies
brain (glioblastoma multiforme), lung (squamous carcinoma), and ovarian (serous cystadenocarcinoma).
F.S. Collins & A.D. Barker, Sci. Am, Mar. 2007
Large-Scale Studies of Cancer Genomes
41
Heterogeneity within and across tumor types
High rate of abnormalities (driver vs
passenger)
Sample quality matters
Consent and controlled data access is
complicated
Lessons learned
42
International Cancer Genome Consortium
• Collect ~500 tumour/normal pairs from each of 50 different major
cancer types;
• Comprehensive genome analysis of each T/N pair:
– Genome
– Transcriptome
– Methylome
– Clinical data
• Make the data available to the research community & public.
Identify
genome
changes
…GATTATTCCAGGTAT… …GATTATTGCAGGTAT… …GATTATTGCAGGTAT…
43
Rationale for the ICGC
• The scope is huge, such that no country can do it all.
• Coordinated cancer genome initiatives will reduce
duplication of effort for common and easy to acquire
tumor samples and and ensure complete studies for many
less frequent forms of cancer.
• Standardization and uniform quality measures across
studies will enable the merging of datasets, increasing
power to detect additional targets.
• The spectrum of many cancers varies across the
world for many tumor types, because of environmental,
genetic and other causes.
• The ICGC will accelerate the dissemination of genomic
and analytical methods across participating sites, and
the user community
44
International Cancer Genome Consortium
(ICGC)Goals
• Catalogue genomic abnormalities in tumors in 50 different cancer types and/or subtypes of clinical and societal importance across the globe
• Generate complementary catalogues of transcriptomic and epigenomic datasets from the same tumors
• Make the data available to research community rapidly with minimal restrictions to accelerate research into the causes and control of cancer
50 tumor types and/or subtypes
500 tumors + 500 controls per subtype
50,000 Human Genome Projects!
Nature (2010) 464:993
45
ICGC
Goals, Structure,
Policies & Guidelines
http://goo.gl/sPGLQN
46
Primary Goal: coordinate efforts to
reach goals (50 tumours)
47
http://docs.icgc.org/dcc-data-element-specifications
48
Primary Goal: be comprehensive
http://goo.gl/BE7KH1
49
Analysis Data Types
• Germline variants (SNPs)
• Simple Somatic Mutations (SSM)
• Copy Number Alterations (CNA)
• Structural Variants (SV)
• Gene Expression (micro-arrays and RNASeq)
• miRNA Expression (RNASeq)
• Epigenomics (Arrays and Methylation)
• Splicing Variation (RNASeq)
• Protein Expression (Arrays)
50
Primary Goal: generate highest quality
http://goo.gl/FXCvi9
51
52
Primary Goal: available to all
53
Primary Goal: available to all
54
• Detailed Phenotype and Outcome data
Region of residence
Risk factors
Examination
Surgery
Radiation
Sample
Slide
Specific histological features
Analyte
Aliquot
Donor notes
• Gene Expression (probe-level data)
• Raw genotype calls
• Gene-sample identifier links
• Genome sequence files
ICGC Controlled
Access Datasets
• Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
• Patient/Person
Gender, Age range,
Vital status, Survival time
Relapse type, Status at follow-up
• Gene Expression (normalized)
• DNA methylation
•Computed Copy Number and
Loss of Heterozygosity
• Newly discovered somatic variants
ICGC OA
Datasets
http://goo.gl/w4mrV
55
Secondary Goal: coordinate
work to benefit productivity
http://goo.gl/K5mHC3
56
https://icgc.org/icgc/committees-and-working-groups
57
Secondary Goal: disseminate knowledge
http://goo.gl/ObcZXy
58
ICGC
Goals, Structure,
Policies & Guidelines
http://goo.gl/sPGLQN
59
Policy
ICGC membership implies compliance with Core
Bioethical Elements for samples used in ICGC
Cancer Projects:
http://goo.gl/TFrCmK
http://goo.gl/nYx6YG
60
POLICY:
The members of the International Cancer Genomics
Consortium (ICGC) are committed to the principle of
rapid data release to the scientific community.
http://goo.gl/TFrCmK
61
Publication Policy
• The individual research groups in
the ICGC are free to publish the
results of their own efforts in
independent publications at any
time (subject, of course, to any
policies of any collaborations in
which they may be participating).
62
Moratorium: http://www.icgc.org/icgc/goals-structure-policies-guidelines/e3-publication-policy
63
Publication Policy
64
Where do you find that information?
• We actually make it hard to find, but we are
working on that! (this is an example of where ICGC
would like to do what TCGA does!)
• http://cancergenome.nih.gov/publications/publicatio
nguidelines
65
Where do you find that information?
For ICGC data:
• Need to find the policy!• http://icgc.org/icgc/goals-structure-policies-
guidelines/e3-publication-policy
• Find text:
• Find date: in README on FTP file
• This is bad, we know it, and we are fixing it!
• In doubt, contact us: [email protected]
66
Policy on Intellectual Property
• All ICGC members agree not to make claims to
possible IP derived from primary data (including
somatic mutations) and to not pursue IP
protections that would prevent or block access to
or use of any element of ICGC data or conclusions
drawn directly from those data.
http://goo.gl/TCMXCl
67
ICGC Map – May 201472 projects launched
68
DCC ActivitiesDCC activities are split between two groups:
• Software Development
– DCC portal
– Submission tool
• Biocuration (which also includes Content
Management)
– Data level management
– Submitter “handling”
– Coordination with secretariat
– User support
http://dcc.icgc.org/team68
69
Data
ValidationValidationValidation(dictionary)
Validation(across fields)
Validation(across fields)
Validation(across fields)
indexing
Happy Users
http://goo.gl/1EcyR
70
http://docs.icgc.org/methods
71
http://docs.icgc.org/dcc-data-element-specifications
72
ICGC Biocuration
• Helping submitters get their data to ICGC
• Progress reporting (data audit)
• Quality checks (coverage, correctness, etc.)
• Helping users get to the data
• Validate and check (and recheck) metadata on public
repositories
• Test and integrate with other public repositories via
standard data formats, ontologies.
• Documentation, documentation, and more documentation
• Training
72
73
ICGC datasets to date
ICGC Data Portal Cumulative Donor Count for Member Projects
2000
4000
6000
8000
10,000
12,000
14,000
0
Number of
Donors
Release 7
Release 8
Release 9
Release 10
Release 11
Release 12Release 13
Release 14
Release 15
Release 16Release 17
•Cancer types: 50
•Body sites: 18
•Donors: 12,232
•Specimens: 24, 661
•Simple somatic mutations: 9,871,477
•Mutated genes: 57,526
ICGC dataset version 17
Sept 11th 2014
75
Clinical Data Completeness
Donor interval of last followup
Donor Tumour stage at diagnosis
Donor Tumour staging system at diagnosis
Donor diagnosis ICG10
DonorFields
Donor survival time
Donor Tumour stage at diagnosis supplemental
Donor relapse interval
Donor age at last followup
Donor relapse type
Donor age at diagnosis
Disease status last followup
Donor region of residence
Donor sex
Donor ID
Donor vital status
Average Percentage Completeness
Overall Donor Clinical Data Completeness
76
Clinical Data Completeness
Donor interval of last followup
Donor Tumour stage at diagnosis
Donor Tumour staging system at diagnosis
Donor diagnosis ICG10
DonorFields
Donor survival time
Donor Tumour stage at diagnosis supplemental
Donor relapse interval
Donor age at last followup
Donor relapse type
Donor age at diagnosis
Disease status last followup
Donor region of residence
Donor sex
Donor ID
Donor vital status
Average Percentage Completeness
Overall Donor Clinical Data Completeness
77
Clinical Data Completeness
Overall Specimen Clinical Data Completeness
Level of cellularity
Percentage cellularity
Digital Image of Stained Section
Tumour Stage Supplemental
Tumour Stage
Tumour Stage System
Tumour Grade Supplemental
Tumour Grade
Tumour Grading System
Tumour Histological Type
Specimen available
Specimen Biobank ID
Specimen Biobank
Tumour confirmed
Specimen storage other
Specimen storage
Specimen processing other
Specimen processing
Specimen donor treatment type
Specimen Interval
Specimen type
Specimen type other
Specimen ID
Donor ID
Specimen donor treatment type other
SpecimenFields
0 20 40 60 80
Average Percentage Completeness
10 30 50 70 90 100
78
Clinical Data Completeness
Overall Specimen Clinical Data Completeness
Level of cellularity
Percentage cellularity
Digital Image of Stained Section
Tumour Stage Supplemental
Tumour Stage
Tumour Stage System
Tumour Grade Supplemental
Tumour Grade
Tumour Grading System
Tumour Histological Type
Specimen available
Specimen Biobank ID
Specimen Biobank
Tumour confirmed
Specimen storage other
Specimen storage
Specimen processing other
Specimen processing
Specimen donor treatment type
Specimen Interval
Specimen type
Specimen type other
Specimen ID
Donor ID
Specimen donor treatment type other
SpecimenFields
0 20 40 60 80
Average Percentage Completeness
10 30 50 70 90 100
79
DACO
ICGC
cgHUB
EGA
TCGA
BAM
Open
OpenBA
M
Germ
Line
+ EGA id
BA
MBA
M
ERA
ICGCBAM/FASTQ
TCGABAM/FASTQ
ICGCOpen
Data
(includes
TCGA
Open Data)
COSMICOpen
Data
81
Raw Data Availability at EGA by Project and Data Type
• https://www.ebi.ac.uk/ega/organisations/EGAO00000000024
82
83
84
85
Select “Bladder Cancer – China”
86
Select “Pancreatic cancer – Canada”
87
… But where is the data?
88
89
http://dcc.icgc.org/
90
91
92
Highlights of the new portal: dcc.icgc.org
• Faceted searches capabilities for variants, genes and
donors
– Interactive data exploration fast and easy
• Mutation aggregation & counts across donors and cancers
– # of pancreatic cancers donors with mutation KRAS G12D
• Standardized gene consequence across all projects
• Genome browser
• Data doewnload
• Protein domains
• Links to repositories
93
KRAS search
94
• Summary
• Cancer type distribution
• Other links (Cosmic, Entrez, etc)
• Mutation profile in protein
• Domains
• Genomic Context
• Mutation profile
• Most common mutations
95
http://dcc.icgc.org/genes/ENSG00000133703
96
97
98
99
Donor• Donor ID
• Primary site
• Cancer Project
• Gender
• Tumor Stage
• Vital Status
• Disease Status
• Release type
• Age at diagnosis
• Available data types
• Analysis types
100
Genes
101
Mutations• Consequences
• Type
• Platform
• Verification status
102
Exporting data
103
Exporting data
104
105
Exporting data
106
Can do bulk download of the data …
107
BIGDATA
ValidationValidationRAW
DATA
MetaDATA
Interpreted data
✔
✔
✔
✔
✔
108
DACO
ICGC
dbGaP
EGA
TCGA
BAM
Open
Open
ERA
BA
M
Germ
Line
+ EGA id
BA
MBA
M
109
ICGC Data Categories
ICGC Open Access Datasets ICGC Controlled Access Datasets
Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
Donor
Gender
Age range
RNA expression (normalized)
DNA methylation
Genotype frequencies
Somatic mutations (SNV,
CNV and Structural
Rearrangement)
Detailed Phenotype and Outcome Data
Patient demography
Risk factors
Examination
Surgery/Drugs/Radiation
Sample/Slide
Specific histological features
Protocol
Analyte/Aliquot
Gene Expression (probe-level data)
Raw genotype calls (germline)
Gene-sample identifier links
Genome sequence files
Most of the data in the portal is publically available without restriction. However,
access to some data, like the germline mutations, requires authorization by the Data
Access Compliance Office (DACO)
http://icgc.org/daco
112
• Detailed Phenotype and Outcome data
Region of residence
Risk factors
Examination
Surgery
Radiation
Sample
Slide
Specific histological features
Analyte
Aliquot
Donor notes
• Gene Expression (probe-level data)
• Raw genotype calls
• Gene-sample identifier links
• Genome sequence files
ICGC Controlled
Access Datasets
• Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
• Patient/Person
Gender, Age range,
Vital status, Survival time
Relapse type, Status at follow-up
• Gene Expression (normalized)
• DNA methylation
•Computed Copy Number and
Loss of Heterozygosity
• Newly discovered somatic variants
ICGC OA
Datasets
http://goo.gl/w4mrV
Identify
yourselfFill out detail form which
includes:
• Contact and Project
Information
•Information Technology
details and procedures
for keeping data secure
•Data Access Agreement
All of these
documents are
put into a PDF
file that you
print and get your
institution to sign
off on your behalf
‹#›
‹#›
‹#›
‹#›
‹#›
‹#›
DACO approved projects:
> 160 groups - 75% academic
(> 870 people)
121121
Nature 409:452
Bioinformatics Citizenship: What it means,
and what does it cost?
122
Important messages:
• The ICGC portal is evolving and getting better all
the time
• Lots of data provided by the ICGC
• Important to be good citizens of the scientific world
• The idea behind all of this is to provide tools to
help cure cancer
• Need to respect policies and guidelines
• There is help out there, and user feedback is
*always* welcome.
123
DCC Software
Developer
Vincent Ferretti
Daniel Chang
Anthony Cros
Jerry Lam
Brian O'Connor
Bob Tiernay
Stuart Watt
Shane Wilson
Junjun Zhang
Acknowledgments
ICGC Project leaders at the OICR:
Tom Hudson
John McPherson
Lincoln Stein
Jared Simpson
Paul Boutros
Vincent Ferretti
Francis Ouellette
Jennifer Jennings
Ouellette Lab
Michelle Brazas
Emilie Chautard
Nina Palikuca
Zhibin Lu
Web Dev
Joseph Yamada
Angela Chao
Daniel Gross
Kamen Wu
Kim Cullion
Miyuki Fukuma
Wen Xu
Pipeline Development
& Evaluation
Morgan Taschuk
Michael Laszloffy
Peter Ruzanov
ICGC DCC Biocuration
Hardeep Nahal
Marc Perry
http://oicr.on.ca http://icgc.org
… and all the patients and their
families that that are putting their
hopes into our work!
Research IT/Systems
David Sutton,
Bob Gibson
Sam Maclennan
David Magda
Rob Naccarato
Brian Ott
Gino Yearwood
EGA
Justin Paschall
Jeff Almeida-King
Ilkka Lappalainen
Jordi Rambla De Argila
Marc Sitges Puy
124Informatics and Biocomputing at the OICR
125
http://oicr.on.ca/careers
126
127
http://icgc.org
http://dcc.icgc.org
http://docs.icgc.org
@bffo
Video tutorial: https://vimeo.com/75522669