Nov 2014 ouellette_windsor_icgc_final

  • View
    544

  • Download
    2

  • Category

    Science

Preview:

DESCRIPTION

Presentation to the Department of Biology at the University of Windsor, Windsor, Ontario. The description and update of activities related to the International Cancer Genome Consortium (ICGC)

Citation preview

A project status for the International Cancer Genome

Consortium (ICGC).

November 21th 2014

B.F. Francis Ouellette francis@oicr.on.ca

• Senior Scientists & Associate Director,

Informatics and Biocomputing, Ontario Institute for

Cancer Research, Toronto, ON

• Associate Professor, Department of Cell and Systems Biology,

University of Toronto, Toronto, ON.

@bffo on

2

You are free to:

Copy, share, adapt, or re-mix;

Photograph, film, or broadcast;

Blog, live-blog, or post video of;

This presentation. Provided that:

You attribute the work to its author and respect the rights

and licenses associated with its components.

Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.

Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;

http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites

3

But first, a little about me …

… an unfinished story!

4http://goo.gl/v8G57

5http://goo.gl/WKLNr

http://goo.gl/dJIur

http://goo.gl/LwVOZ

http://goo.gl/kTGkG

http://goo.gl/sO5Na

http://goo.gl/LwVOZ

http://goo.gl/QI6aL

http://goo.gl/mYHFO

http://goo.gl/Jc5TK

15

(from the National Centre for Biotechnology Information)

from the National Centre for Biotechnology Information

16

from the National Centre for Biotechnology Information

17

from the National Centre for Biotechnology Information

PANIC

18

19

PANIC

20

PANIC

1998

2001

2004

22

1999

30http://goo.gl/ZKzMV

31http://goo.gl/ZKzMV

32

33

International Cancer Genome Consortium: icgc.org

34

http://www.csb.utoronto.ca/

35

http://bioinformatics.ca/

36

37

http://bioinformatics.ca/workshops/2014

38

E-mail: course_info@bioinformatics.ca

Web: http://bioinformatics.ca

39

CancerA Disease of the Genome

Challenge in Treating Cancer:

Every tumor is different

Every cancer patient is different

40

Johns Hopkins

> 18,000 genes analyzed for mutations

11 breast and 11 colon tumors

L.D. Wood et al, Science, Oct. 2007

Wellcome Trust Sanger Institute

518 genes analyzed for mutations

210 tumors of various types

C. Greenman et al, Nature, Mar. 2007

TCGA (NIH)

Multiple technologies

brain (glioblastoma multiforme), lung (squamous carcinoma), and ovarian (serous cystadenocarcinoma).

F.S. Collins & A.D. Barker, Sci. Am, Mar. 2007

Large-Scale Studies of Cancer Genomes

41

Heterogeneity within and across tumor types

High rate of abnormalities (driver vs

passenger)

Sample quality matters

Consent and controlled data access is

complicated

Lessons learned

42

International Cancer Genome Consortium

• Collect ~500 tumour/normal pairs from each of 50 different major

cancer types;

• Comprehensive genome analysis of each T/N pair:

– Genome

– Transcriptome

– Methylome

– Clinical data

• Make the data available to the research community & public.

Identify

genome

changes

…GATTATTCCAGGTAT… …GATTATTGCAGGTAT… …GATTATTGCAGGTAT…

43

Rationale for the ICGC

• The scope is huge, such that no country can do it all.

• Coordinated cancer genome initiatives will reduce

duplication of effort for common and easy to acquire

tumor samples and and ensure complete studies for many

less frequent forms of cancer.

• Standardization and uniform quality measures across

studies will enable the merging of datasets, increasing

power to detect additional targets.

• The spectrum of many cancers varies across the

world for many tumor types, because of environmental,

genetic and other causes.

• The ICGC will accelerate the dissemination of genomic

and analytical methods across participating sites, and

the user community

44

International Cancer Genome Consortium

(ICGC)Goals

• Catalogue genomic abnormalities in tumors in 50 different cancer types and/or subtypes of clinical and societal importance across the globe

• Generate complementary catalogues of transcriptomic and epigenomic datasets from the same tumors

• Make the data available to research community rapidly with minimal restrictions to accelerate research into the causes and control of cancer

50 tumor types and/or subtypes

500 tumors + 500 controls per subtype

50,000 Human Genome Projects!

Nature (2010) 464:993

45

ICGC

Goals, Structure,

Policies & Guidelines

http://goo.gl/sPGLQN

46

Primary Goal: coordinate efforts to

reach goals (50 tumours)

47

http://docs.icgc.org/dcc-data-element-specifications

48

Primary Goal: be comprehensive

http://goo.gl/BE7KH1

49

Analysis Data Types

• Germline variants (SNPs)

• Simple Somatic Mutations (SSM)

• Copy Number Alterations (CNA)

• Structural Variants (SV)

• Gene Expression (micro-arrays and RNASeq)

• miRNA Expression (RNASeq)

• Epigenomics (Arrays and Methylation)

• Splicing Variation (RNASeq)

• Protein Expression (Arrays)

50

Primary Goal: generate highest quality

http://goo.gl/FXCvi9

51

52

Primary Goal: available to all

53

Primary Goal: available to all

54

• Detailed Phenotype and Outcome data

Region of residence

Risk factors

Examination

Surgery

Radiation

Sample

Slide

Specific histological features

Analyte

Aliquot

Donor notes

• Gene Expression (probe-level data)

• Raw genotype calls

• Gene-sample identifier links

• Genome sequence files

ICGC Controlled

Access Datasets

• Cancer Pathology

Histologic type or subtype

Histologic nuclear grade

• Patient/Person

Gender, Age range,

Vital status, Survival time

Relapse type, Status at follow-up

• Gene Expression (normalized)

• DNA methylation

•Computed Copy Number and

Loss of Heterozygosity

• Newly discovered somatic variants

ICGC OA

Datasets

http://goo.gl/w4mrV

55

Secondary Goal: coordinate

work to benefit productivity

http://goo.gl/K5mHC3

56

https://icgc.org/icgc/committees-and-working-groups

57

Secondary Goal: disseminate knowledge

http://goo.gl/ObcZXy

58

ICGC

Goals, Structure,

Policies & Guidelines

http://goo.gl/sPGLQN

59

Policy

ICGC membership implies compliance with Core

Bioethical Elements for samples used in ICGC

Cancer Projects:

http://goo.gl/TFrCmK

http://goo.gl/nYx6YG

60

POLICY:

The members of the International Cancer Genomics

Consortium (ICGC) are committed to the principle of

rapid data release to the scientific community.

http://goo.gl/TFrCmK

61

Publication Policy

• The individual research groups in

the ICGC are free to publish the

results of their own efforts in

independent publications at any

time (subject, of course, to any

policies of any collaborations in

which they may be participating).

62

Moratorium: http://www.icgc.org/icgc/goals-structure-policies-guidelines/e3-publication-policy

63

Publication Policy

64

Where do you find that information?

• We actually make it hard to find, but we are

working on that! (this is an example of where ICGC

would like to do what TCGA does!)

• http://cancergenome.nih.gov/publications/publicatio

nguidelines

65

Where do you find that information?

For ICGC data:

• Need to find the policy!• http://icgc.org/icgc/goals-structure-policies-

guidelines/e3-publication-policy

• Find text:

• Find date: in README on FTP file

• This is bad, we know it, and we are fixing it!

• In doubt, contact us: info@icgc.org

66

Policy on Intellectual Property

• All ICGC members agree not to make claims to

possible IP derived from primary data (including

somatic mutations) and to not pursue IP

protections that would prevent or block access to

or use of any element of ICGC data or conclusions

drawn directly from those data.

http://goo.gl/TCMXCl

67

ICGC Map – May 201472 projects launched

68

DCC ActivitiesDCC activities are split between two groups:

• Software Development

– DCC portal

– Submission tool

• Biocuration (which also includes Content

Management)

– Data level management

– Submitter “handling”

– Coordination with secretariat

– User support

http://dcc.icgc.org/team68

69

Data

ValidationValidationValidation(dictionary)

Validation(across fields)

Validation(across fields)

Validation(across fields)

indexing

Happy Users

http://goo.gl/1EcyR

70

http://docs.icgc.org/methods

71

http://docs.icgc.org/dcc-data-element-specifications

72

ICGC Biocuration

• Helping submitters get their data to ICGC

• Progress reporting (data audit)

• Quality checks (coverage, correctness, etc.)

• Helping users get to the data

• Validate and check (and recheck) metadata on public

repositories

• Test and integrate with other public repositories via

standard data formats, ontologies.

• Documentation, documentation, and more documentation

• Training

72

73

ICGC datasets to date

ICGC Data Portal Cumulative Donor Count for Member Projects

2000

4000

6000

8000

10,000

12,000

14,000

0

Number of

Donors

Release 7

Release 8

Release 9

Release 10

Release 11

Release 12Release 13

Release 14

Release 15

Release 16Release 17

•Cancer types: 50

•Body sites: 18

•Donors: 12,232

•Specimens: 24, 661

•Simple somatic mutations: 9,871,477

•Mutated genes: 57,526

ICGC dataset version 17

Sept 11th 2014

75

Clinical Data Completeness

Donor interval of last followup

Donor Tumour stage at diagnosis

Donor Tumour staging system at diagnosis

Donor diagnosis ICG10

DonorFields

Donor survival time

Donor Tumour stage at diagnosis supplemental

Donor relapse interval

Donor age at last followup

Donor relapse type

Donor age at diagnosis

Disease status last followup

Donor region of residence

Donor sex

Donor ID

Donor vital status

Average Percentage Completeness

Overall Donor Clinical Data Completeness

76

Clinical Data Completeness

Donor interval of last followup

Donor Tumour stage at diagnosis

Donor Tumour staging system at diagnosis

Donor diagnosis ICG10

DonorFields

Donor survival time

Donor Tumour stage at diagnosis supplemental

Donor relapse interval

Donor age at last followup

Donor relapse type

Donor age at diagnosis

Disease status last followup

Donor region of residence

Donor sex

Donor ID

Donor vital status

Average Percentage Completeness

Overall Donor Clinical Data Completeness

77

Clinical Data Completeness

Overall Specimen Clinical Data Completeness

Level of cellularity

Percentage cellularity

Digital Image of Stained Section

Tumour Stage Supplemental

Tumour Stage

Tumour Stage System

Tumour Grade Supplemental

Tumour Grade

Tumour Grading System

Tumour Histological Type

Specimen available

Specimen Biobank ID

Specimen Biobank

Tumour confirmed

Specimen storage other

Specimen storage

Specimen processing other

Specimen processing

Specimen donor treatment type

Specimen Interval

Specimen type

Specimen type other

Specimen ID

Donor ID

Specimen donor treatment type other

SpecimenFields

0 20 40 60 80

Average Percentage Completeness

10 30 50 70 90 100

78

Clinical Data Completeness

Overall Specimen Clinical Data Completeness

Level of cellularity

Percentage cellularity

Digital Image of Stained Section

Tumour Stage Supplemental

Tumour Stage

Tumour Stage System

Tumour Grade Supplemental

Tumour Grade

Tumour Grading System

Tumour Histological Type

Specimen available

Specimen Biobank ID

Specimen Biobank

Tumour confirmed

Specimen storage other

Specimen storage

Specimen processing other

Specimen processing

Specimen donor treatment type

Specimen Interval

Specimen type

Specimen type other

Specimen ID

Donor ID

Specimen donor treatment type other

SpecimenFields

0 20 40 60 80

Average Percentage Completeness

10 30 50 70 90 100

79

DACO

ICGC

cgHUB

EGA

TCGA

BAM

Open

OpenBA

M

Germ

Line

+ EGA id

BA

MBA

M

ERA

ICGCBAM/FASTQ

TCGABAM/FASTQ

ICGCOpen

Data

(includes

TCGA

Open Data)

COSMICOpen

Data

81

Raw Data Availability at EGA by Project and Data Type

• https://www.ebi.ac.uk/ega/organisations/EGAO00000000024

82

83

84

85

Select “Bladder Cancer – China”

86

Select “Pancreatic cancer – Canada”

87

… But where is the data?

88

89

http://dcc.icgc.org/

90

91

92

Highlights of the new portal: dcc.icgc.org

• Faceted searches capabilities for variants, genes and

donors

– Interactive data exploration fast and easy

• Mutation aggregation & counts across donors and cancers

– # of pancreatic cancers donors with mutation KRAS G12D

• Standardized gene consequence across all projects

• Genome browser

• Data doewnload

• Protein domains

• Links to repositories

93

KRAS search

94

• Summary

• Cancer type distribution

• Other links (Cosmic, Entrez, etc)

• Mutation profile in protein

• Domains

• Genomic Context

• Mutation profile

• Most common mutations

95

http://dcc.icgc.org/genes/ENSG00000133703

96

97

98

99

Donor• Donor ID

• Primary site

• Cancer Project

• Gender

• Tumor Stage

• Vital Status

• Disease Status

• Release type

• Age at diagnosis

• Available data types

• Analysis types

100

Genes

101

Mutations• Consequences

• Type

• Platform

• Verification status

102

Exporting data

103

Exporting data

104

105

Exporting data

106

Can do bulk download of the data …

107

BIGDATA

ValidationValidationRAW

DATA

MetaDATA

Interpreted data

108

DACO

ICGC

dbGaP

EGA

TCGA

BAM

Open

Open

ERA

BA

M

Germ

Line

+ EGA id

BA

MBA

M

109

ICGC Data Categories

ICGC Open Access Datasets ICGC Controlled Access Datasets

Cancer Pathology

Histologic type or subtype

Histologic nuclear grade

Donor

Gender

Age range

RNA expression (normalized)

DNA methylation

Genotype frequencies

Somatic mutations (SNV,

CNV and Structural

Rearrangement)

Detailed Phenotype and Outcome Data

Patient demography

Risk factors

Examination

Surgery/Drugs/Radiation

Sample/Slide

Specific histological features

Protocol

Analyte/Aliquot

Gene Expression (probe-level data)

Raw genotype calls (germline)

Gene-sample identifier links

Genome sequence files

Most of the data in the portal is publically available without restriction. However,

access to some data, like the germline mutations, requires authorization by the Data

Access Compliance Office (DACO)

http://icgc.org/daco

112

• Detailed Phenotype and Outcome data

Region of residence

Risk factors

Examination

Surgery

Radiation

Sample

Slide

Specific histological features

Analyte

Aliquot

Donor notes

• Gene Expression (probe-level data)

• Raw genotype calls

• Gene-sample identifier links

• Genome sequence files

ICGC Controlled

Access Datasets

• Cancer Pathology

Histologic type or subtype

Histologic nuclear grade

• Patient/Person

Gender, Age range,

Vital status, Survival time

Relapse type, Status at follow-up

• Gene Expression (normalized)

• DNA methylation

•Computed Copy Number and

Loss of Heterozygosity

• Newly discovered somatic variants

ICGC OA

Datasets

http://goo.gl/w4mrV

Identify

yourselfFill out detail form which

includes:

• Contact and Project

Information

•Information Technology

details and procedures

for keeping data secure

•Data Access Agreement

All of these

documents are

put into a PDF

file that you

print and get your

institution to sign

off on your behalf

‹#›

‹#›

‹#›

‹#›

‹#›

‹#›

DACO approved projects:

> 160 groups - 75% academic

(> 870 people)

121121

Nature 409:452

Bioinformatics Citizenship: What it means,

and what does it cost?

122

Important messages:

• The ICGC portal is evolving and getting better all

the time

• Lots of data provided by the ICGC

• Important to be good citizens of the scientific world

• The idea behind all of this is to provide tools to

help cure cancer

• Need to respect policies and guidelines

• There is help out there, and user feedback is

*always* welcome.

123

DCC Software

Developer

Vincent Ferretti

Daniel Chang

Anthony Cros

Jerry Lam

Brian O'Connor

Bob Tiernay

Stuart Watt

Shane Wilson

Junjun Zhang

Acknowledgments

ICGC Project leaders at the OICR:

Tom Hudson

John McPherson

Lincoln Stein

Jared Simpson

Paul Boutros

Vincent Ferretti

Francis Ouellette

Jennifer Jennings

Ouellette Lab

Michelle Brazas

Emilie Chautard

Nina Palikuca

Zhibin Lu

Web Dev

Joseph Yamada

Angela Chao

Daniel Gross

Kamen Wu

Kim Cullion

Miyuki Fukuma

Wen Xu

Pipeline Development

& Evaluation

Morgan Taschuk

Michael Laszloffy

Peter Ruzanov

ICGC DCC Biocuration

Hardeep Nahal

Marc Perry

http://oicr.on.ca http://icgc.org

… and all the patients and their

families that that are putting their

hopes into our work!

Research IT/Systems

David Sutton,

Bob Gibson

Sam Maclennan

David Magda

Rob Naccarato

Brian Ott

Gino Yearwood

EGA

Justin Paschall

Jeff Almeida-King

Ilkka Lappalainen

Jordi Rambla De Argila

Marc Sitges Puy

124Informatics and Biocomputing at the OICR

125

http://oicr.on.ca/careers

126

127

http://icgc.org

http://dcc.icgc.org

http://docs.icgc.org

info@icgc.org

@bffo

Video tutorial: https://vimeo.com/75522669