Upload
genomeinabottle
View
443
Download
1
Tags:
Embed Size (px)
Citation preview
Project plan for generating a somatic data truth set for NGS cancer assay validation: COLO-829 and fusion spike-in materials
Stephanie J.K. Pond8/15/13
2
There is a need for development and widespread adoption of standards to facilitate tool development and assay validation for next-gen sequencing in cancer applications.
– Cancer standards are needed for somatic calls for SNVs, indels, structural variants, copy number variation, and RNA fusion detection.
There is limited publicly available data that can act as a “gold standard” dataset.
We embarked on a multi-lab collaboration to generate a set of somatic calls that can be used as a truth dataset for validations and evaluating assay performance
– In this initial work, we are excluding FFPE samples
Introduction
3
Cell lines have been previously sequenced and somatic calls from the DNA were published.
– Pleasance et al. Nature 2010, 463(7278): 191-196.
– Found variants in the major categories of SNVs, indels, CNVs, SV that need to be investigated for cancer applications
– Substitutions, insertions, deletions were confirmed by capillary sequencing
– Structural variants were confirmed by PCR across the breakpoint and capillary sequencing
– Confirmations in both cell lines to confirm somatic vs. germline variants.
We want to expand on this dataset.
COLO-829, COLO-829BL Cell Lines
Cancer Type Tissue Source Name ATCC No. Tissue source Name ATCC No
Melanoma; malignant skin COLO 829 CRL-1974 B lymphoblast COLO 829BL CRL-1980
Circos from COSMIC database
4
Whole genome sequencing of COLO-829 and COLO-829BL at a depth of 90x is being generated to build a set of consensus calls:
– TGen HiSeq 2500 Multiple variant callers Cell passage A
– TGen Samples sent for sequencing to Complete Genomics
to incorporate an orthogonal technology
– Illumina HiSeq 2500 Cell passage B
The consensus of the datasets will establish a set of somatic calls that can be used as a gold standard in analytical validations
– expand the set in the literature– a second set of lower confidence somatic calls (2/3
datasets) may also be identified
Whole Genome Sequencing of COLO-829 and COLO-829BL
Consen-sus calls
TGen (Complete Genomics)
TGen (HiSeq)
ILMN (HiSeq)
5
Synthetic Oligo Spike-In mRNA Transcripts
T7 AscI GeneA GeneB NotI T3(rc)ID Genes
Transcript Length (excluding poly A+)
TFG01 EWS-ATF1 1150TFG02 TMPRSS2-ETV1 1282TFG03 EWS-FLI1 1483TFG04 NTRK3-ETV6 1954TFG05 CD74-ROS1 2164TFG06 HOOK3-RET 2383TFG07 EML4-ALK 3442TFG08 AKAP9-BRAF 4531TFG09 BCR-ABL N/A*TFG10 BRD4-NUT 3969
*IDT could not synthesize TFG09 due to significant secondary structure
• 9 fusion gene sequences of clinically relevant gene fusions were pulled from GeneBank and were synthesized as DNA plasmids by IDT.
• Reverse transcription of the purified plasmid, followed by poly-A tailing, resulted in mRNA transcripts of known sequence.
• These constructs can be used as spike-in control materials in mRNA protocols to assess the ability to detect fusion genes, a critical mutation type in cancer.
6
Pool of fusion spikes was added to COLO-829 total RNA at different concentrations.
Data shows a linear response at higher concentrations, and poor detection below a threshold value.
One spike (TMPRSS2-ETV1) is not detected, even at the highest concentrations, although it is present at very high read counts
– Hypothesis is that the fusion is near the 5’ end of the transcript, and breakpoint position is affecting fusion calling (remains to be tested)
– Highlights the need for standard materials in this area
Preliminary tests of the synthetic oligos appear promising
-14 -13 -12 -11 -10 -9 -8 -7 -60
1
2
3
4
5
6
TopHat-Fusion
ChimeraScan
SnowShoes
Fusion spike RNA concentration (log10 nmoles) Su
pp
ort
ing
ev
ide
nc
e s
tre
ng
th (
log
10
re
ad
co
un
ts)
7
Whole-Genome
TGen –
HiSeq 2500
TGen – Complete Genomics
ILMN -
HiSeq 2500
Exomes SNVs
0:100% N:T• Replicate 1• Replicate 2• Replicate 3
50:50• Replicate 1• Replicate 2• Replicate 3
75:25• Replicate 1• Replicate 2• Replicate 3
90:10• Replicate 1• Replicate 2• Replicate 3
95:5• Replicate 1• Replicate 2• Replicate 3
99:1• Replicate 1• Replicate 2• Replicate 3
100:0• Replicate 1• Replicate 2• Replicate 3
WGS Large Insert
Structural Variants
0:100% N:T• Replicate 1• Replicate 2• Replicate 3
50:50• Replicate 1• Replicate 2• Replicate 3
75:25• Replicate 1• Replicate 2• Replicate 3
90:10• Replicate 1• Replicate 2• Replicate 3
95:5• Replicate 1• Replicate 2• Replicate 3
99:1• Replicate 1• Replicate 2• Replicate 3
100:0• Replicate 1• Replicate 2• Replicate 3
RNA Diff. Exp. Fusions
Tumor• Replicate 1• Replicate 2• Replicate 3
Tumor ERCC 1• Replicate 1• Replicate 2• Replicate 3
Tumor ERCC 2• Replicate 1• Replicate 2• Replicate 3
Normal • Replicate 1• Replicate 2• Replicate 3
Norm ERCC 1• Replicate 1• Replicate 2• Replicate 3
Norm ERCC 2• Replicate 1• Replicate 2• Replicate 3
Fusion spikes•Replicate 1•Replicate 2•Replicate 3
Arrays Copy Number
Expression
Agilent
Illumina
Affymetrix
Analytical Validation at TGen
50+ Flow cells6 TB of sequencing dataEquiv ~600 Exomes (TCGA Phase 1)
8
TGen and ILMN have begun a cross-site effort to generate a “gold standard” somatic dataset for a pair of cancer cell lines (COLO-829 & COLO-829BL) as well as a set of synthetic mRNA fusion transcripts.
Data generation is scheduled to be completed this month, analysis thereafter.
We intend to make the data publicly available.
Are these appropriate reference materials?– Cell lines:
Stability Consent
– Fusion materials: Preliminary data is encouraging. Additional experiments are on-going.
We welcome feedback and discussion.
Summary
9
Acknowledgements
Illumina– Han-Yu Chuang– Nancy Kim– Timothy McDaniel– Valerie Montel– Jimmy Perrott
Tgen– Stephanie Buchholtz– John Carpten– David Craig– Winnie Liang– W. Amol Tembe– Tracey White
10
Appendix
11
12