38
1 Cancer Sequencing Quality & The ICGC-TCGA DREAM Somatic Mutation Calling Challenge Dr. Paul C. Boutros Ontario Institute for Cancer Research July 14, 2015

1 Cancer Sequencing Quality & The ICGC-TCGA DREAM Somatic Mutation Calling Challenge Dr. Paul C. Boutros Ontario Institute for Cancer Research July 14,

Embed Size (px)

Citation preview

1

Cancer Sequencing Quality & The ICGC-TCGA DREAM

Somatic Mutation Calling Challenge

Dr. Paul C. Boutros

Ontario Institute for Cancer Research

July 14, 2015

2

The Consequences of Analytical Diversity

SMC-DNA: Challenge-Based Benchmarking

SMC-Het & SMC-RNA

Pathway

33

General Plan for Data-Analysis

Proteomics, genomics, metabolomics…

Data-Analysis Results

44

Different Analysis; Same Conclusions?

Proteomics, genomics, metabolomics…

Data-Analysis Results

55

Biomarker for NSCLC: 24 Methods

Starmans et al.Genome Med 2013

66

Agreement: 151/442 Patients

77

This holds for all tumour-types: breast cancer

Fox et al.2014

74% of genes in 1+16% of genes in 16+0% of genes in 21+

88

Why Do We Need Improved Mutation Detection?

SNVs SVs

Singer Ma (UCSC)

9

The Consequences of Analytical Diversity

SMC-DNA: Challenge-Based Benchmarking

SMC-Het & SMC-RNA

Pathway

1010

Why Do We Need Improved Mutation Detection?

SNVs SVs

Singer Ma (UCSC)

11

DREAM Mutation-Calling Challenge

Real data:●10 T/N pairs (50x/30x)●Two tumour-types:●5 pancreatic●5 prostate●Lane-level FASTQs & BAMs

In silico data:●5 T/N pairs●For “play” and dry-runs●Releases of increasing complexity●Rapid scoring turn-around●BAMs (Novoalign or BWA)

Nov 2013 Aug 2014 Oct Nov

Validation WinnerCompetition

Dec Jan Feb Mar Apr May Jun Jul

1 2 3 4 5

1212

How Can You Get The Data?

Register for the Challenge at Synapse Complete an ICGC DACO Application

Download using Annai’s GeneTorrent No-cost to download

Directly access in the Google Compute Engine (Google cloud) $2,000 free computing

1313

Initial Results

• So Far:o 391 registrantso 3,260 entries on 14 genomes

• On-going post-challenge submissions as people try to understand the failures of their algorithms (a living benchmark!)

• Key discussions on scoring SVs and on improving BamSurgeon (the simulator)

14

Sample Per-Tumour Summaries

100% Cellular Tumours 80% Cellular Tumours

1515

No Evidence of Over-Fitting (1/2)

1616

No Evidence of Over-Fitting (2/2)

1717

Coding Regions Had Lower Error Rates

1818

But Recurrent Errors in Coding Regions

1919

Wisdom of the Crowd (Per Tumour)

2020

BUT Parameterization is Critical

2121

Tuning Improvements Hold Across Tumours

Are we thinking about tumour variant calling wrong?

2222

Sequence Context: Trinucleotides

2323

Where are we now?

Initial SNV analysis complete Ewing et al. in press Nature Methods

Initial SV analysis (of in silico tumours) in progress No-cost to download

Experimental validation studies nearly complete

24

The Consequences of Analytical Diversity

SMC-DNA: Challenge-Based Benchmarking

SMC-Het & SMC-RNA

Pathway

2525

So, What About Heterogeneity?

As part of TCGA-Prostate we were looking at normal cell contamination

We = Svitlana Tyekucheva, Syed Haider, Massimo Loda, Francesca Demichelis

We’d just take a consensus of estimators….

2626

Exactly As Expected….

27

Opening for registration on November 10, 2014

Opening for submissions onAugust 2015 (ahem!)

https://www.synapse.org/#!Synapse:syn2813581

Lcchong, wikipedia

28

Single Sample Multi-Sample

• 50 samples• Simulated from GIAB and a deep-sequenced normal• Cloud-only (GCE+Galaxy) REB, distribution• Varying complexity, mutational load, depth, etc. • ~3 months run-time

• Sample number pending• Similar design, though• Cloud-based (Galaxy)• Similar parameter ranges• 3 months

SMC Tumour Heterogeneity Challenge

2929

BAMSurgeon Overview

30

Validating BAMSurgeon

Changing Aligner Changing Cell-Line

31

Start with a chr-BAM

Phase and create two ph-chr-BAM

Extract reads for normal & contamination

Spike SNVs, CNAs, GRs

Phase A Phase B

ContaminatingNormal

Sub-clone A

Sub-clone B

How Are We Going to Simulate?

32

Final BAM

SNV CallsCNA Calls

MuTect

Strelka

Battenburg

TITAN

Available via Google Cloud / Docker API

How Are We Going to Simulate?

33

Draft of Tumour Design (Not Final!)

34

1. Sub-populations characteristicsa) What is the level of normal “contamination”?b) How many sub-populations are present?c) What are their proportions?

2. What is the phylogenetic order of sub-populations?

3. For each mutation, what sub-populations is it in?

What Are We Scoring?

35

The Consequences of Analytical Diversity

SMC-DNA: Challenge-Based Benchmarking

SMC-Het & SMC-RNA

Pathway

3636

Dr. Robert Bristow Dr. John McPhersonDr. Theodore van der Kwast

CPC-GENE: The People Involved

Boutros LabRichard de BorjaNicholas HardingPablo Hennings-YeomansEmilie LalondeAmin ZiaJianxin WangFrancis NguyenNatalie FoxMichelle Chan-Seng-YueLauren ChongTakafumi YamaguchiVeronica Sabelnykova

Boutros LabRichard de BorjaNicholas HardingPablo Hennings-YeomansEmilie LalondeAmin ZiaJianxin WangFrancis NguyenNatalie FoxMichelle Chan-Seng-YueLauren ChongTakafumi YamaguchiVeronica Sabelnykova

InformaticsTimothy BeckFouad YousifRobert DenrocheXuemei Luo

InformaticsTimothy BeckFouad YousifRobert DenrocheXuemei Luo

GenomicsTaryne ChongAndrew BrownMichelle SamJeremy JohnsLee TimmsNicholas BuchnerAda Wong

GenomicsTaryne ChongAndrew BrownMichelle SamJeremy JohnsLee TimmsNicholas BuchnerAda Wong

Clinico-MolecularDominique TrudelAlice MengGaetano Zafarana

Clinico-MolecularDominique TrudelAlice MengGaetano Zafarana

PIs & PMsMichael FraserMelania PintilieNeil FleshnerLakshmi MuthuswamyColin CollinsThomas HudsonLincoln Stein

PIs & PMsMichael FraserMelania PintilieNeil FleshnerLakshmi MuthuswamyColin CollinsThomas HudsonLincoln Stein

37

SMC-DNA Organizing Team

Sage/DREAM Organizers

Gustavo Stolovitzky

Stephen Friend

Adam Margolin

Thea Norman

Christine Suver

Christopher Bare

Kristen Dang

Bruce Hoff

Mike Kellen

External Organizers

Paul Boutros (OICR)

Josh Stuart (UCSC)

Lincoln Stein (OICR)

Kyle Ellrott (UCSC)

Adam Ewing (UCSC)

Anna Lee (OICR)

Katie Houlahan (OICR)

Cristian Caloian (OICR)

Takafumi Yamaguchi (OICR)

Data Contributors: Funding/Sponsoring/Publication Partners Include:

38

Organizers• Paul Boutros (OICR)

• Josh Stuart (UCSC)

• Gustavo Stolovitzky (IBM)

• Stephan Friend (Sage)

• David Wedge (Sanger)

• Peter Van Loo (UCL)

• Quaid Morris (University of Toronto)

• Thea Norman (Sage)

Data Contributors Funding/Sponsoring/Publication Partners Include:

• Amit Deshwar (University of Toronto)

• Minjeong Ko (OICR)

• Kyle Ellrott (UCSC)

• Christopher Bare (Sage)

• Kristen Dang (Sage)

• Yin Hu (Sage)

• Shannon Carter (Sage)

SMC-Het Organizing Team