Upload
giles-lewis
View
223
Download
4
Tags:
Embed Size (px)
Citation preview
GeneChipGeneChip®® CustomSeq CustomSeq®® Resequencing ArraysResequencing Arrays
GeneChipGeneChip®® CustomSeq CustomSeq®® Arrays Arrays
Most efficient and cost-effective method for large- scale sequence variation analysis
– up to 300kb on single array for >.01 cent per base
Flexible array formats for a variety of applications; 50kb, 100kb, and 300kb
Long “read length” minimizes curation and assembly time
High-quality data – Accuracy >99.9%– Reproducibility >99.9%
Resequencing Assay OverviewResequencing Assay Overview
Hybridization, Wash, and StainHybridization, Wash, and StainScanScanEnd-labeled fragmentsEnd-labeled fragments
Select Select areas of areas of interest by interest by PCRPCR
LabelLabelGenomic DNAGenomic DNA
Pool and Pool and fragmentfragment
PCR productsPCR products
DNA fragmentsDNA fragments
B
BB
B
B
BB
GeneChipGeneChip®® Resequencing System Resequencing System
GeneChip Sequence Analysis Software (GSEQ) 4.0
GeneChip CustomSeq® ArrayGeneChip Resequencing Assay Kit GeneChip Scanner 3000
GeneChip Operating Software (GCOS) 1.4
Data Collection Data Management Client-Server Option
Resequencing Tiling StrategyResequencing Tiling Strategy
ATCGGTAGCCATGCATGAGTTACTATAGCCATCGGTACGTACTCAATGAT
ATCGGTAGCCATGCATGAGTTACTA
TAGCCATCGGTAGGTACTCAATGAT
CAGCTGTCGA
ATCGGTAGCCATCCATGAGTTACTA
TAGCCATCGGTACGTACTCAATGAT
ATCGGTAGCCATTCATGAGTTACTA
TAGCCATCGGTATGTACTCAATGAT
ATCGGTAGCCATACATGAGTTACTA
TAGCCATCGGTAAGTACTCAATGAT
ATCGGTAGCCATGCATGAGTTACTATAGCCATCGGTACGTACTCAATGAT
TCGGTAGCCATGTATGAGTTACTAC
AGCCATCGGTAGATACTCAATGATG
CAGCTGTCGA
TCGGTAGCCATGGATGAGTTACTAC
AGCCATCGGTAGCTACTCAATGATG
TCGGTAGCCATGCATGAGTTACTAC
AGCCATCGGTAGGTACTCAATGATG
TCGGTAGCCATGAATGAGTTACTAC
AGCCATCGGTAGTTACTCAATGATG
Resequencing Tiling StrategyResequencing Tiling Strategy
ATCGGTAGCCATGCATGAGTTACTATAGCCATCGGTACGTACTCAATGAT
CGGTAGCCATGCTTGAGTTACTACA
GCCATCGGTACGAACTCAATGATGT
CAGCTGTCGA
CGGTAGCCATGCGTGAGTTACTACA
GCCATCGGTACGCACTCAATGATGT
CGGTAGCCATGCCTGAGTTACTACA
GCCATCGGTACGGACTCAATGATGT
CGGTAGCCATGCATGAGTTACTACA
GCCATCGGTACGTACTCAATGATGT
Resequencing Tiling StrategyResequencing Tiling Strategy
ATCGGTAGCCATGCATGAGTTACTATAGCCATCGGTACGTACTCAATGAT
GGTAGCCATGCATGAGTTACTACAG
CCATCGGTACGTACTCAATGATGTC
CAGCTGTCGA
GGTAGCCATGCAGGAGTTACTACAG
CCATCGGTACGTCCTCAATGATGTC
GGTAGCCATGCACGAGTTACTACAG
CCATCGGTACGTGCTCAATGATGTC
GGTAGCCATGCAAGAGTTACTACAG
CCATCGGTACGTTCTCAATGATGTC
Resequencing Tiling StrategyResequencing Tiling Strategy
ATCGGTAGCCATGCATGAGTTACTATAGCCATCGGTACGTACTCAATGAT
GTAGCCATGCATTAGTTACTACAGC
CATCGGTACGTAATCAATGATGTCG
CAGCTGTCGA
GTAGCCATGCATGAGTTACTACAGC
CATCGGTACGTACTCAATGATGTCG
GTAGCCATGCATCAGTTACTACAGC
CATCGGTACGTAGTCAATGATGTCG
GTAGCCATGCATAAGTTACTACAGC
CATCGGTACGTATTCAATGATGTCG
Resequencing Tiling StrategyResequencing Tiling Strategy
ATCGGTAGCCATGCATGAGTTACTATAGCCATCGGTACGTACTCAATGAT
TAGCCATGCATGTGTTACTACAGCT
ATCGGTACGTACACAATGATGTCGA
CAGCTGTCGA
TAGCCATGCATGGGTTACTACAGCT
ATCGGTACGTACCCAATGATGTCGA
TAGCCATGCATGCGTTACTACAGCT
ATCGGTACGTACGCAATGATGTCGA
TAGCCATGCATGAGTTACTACAGCT
ATCGGTACGTATTCAATGATGTCGA
Resequencing Tiling StrategyResequencing Tiling Strategy
Resequencing Array PerformanceResequencing Array Performance
* Average call rate reported for arrays where all target has been amplified to sufficient quantity
Call rates* > 90%
Overall Accuracy > 99.9%
Reproducibility > 99.9%
Performance may vary depending on genomic content on individual custom designs. Specific factors that impact performance include:
– GC content– INDELs– Divergence from reference sequence– Multiple SNPs within 10bp
Performance characterized across several data sets–Data Set 1: Mouse – 30kb of sequence in 9 inbred DBA (diploid genome with homozygous SNPs)–Data Set 2: ENCODE – 300kb across 16 CEPH (diploid genome with heterozygous SNPs)–Data Set 3: Mitochondrial – 16kb across 3 replicates of 1 reference (haploid genome with heteroplasmy)
Quality Threshold ScoreQuality Threshold ScoreImpact on call rate verses accuracyImpact on call rate verses accuracy
99.75%
99.80%
99.85%
99.90%
99.95%
100.00%
0 3 6 9 12
QTS
Acc
ura
cy
65%
70%
75%
80%
85%
90%
95%
100%
Cal
l R
ate
Overall Accuracy Call Rate
99.75%
99.80%
99.85%
99.90%
99.95%
100.00%
0 3 6 9 12
QTS
Acc
ura
cy
65%
70%
75%
80%
85%
90%
95%
100%
Cal
l R
ate
Overall Accuracy Call Rate
Homozygous Model
Heterozygous Model
Data Set 1:Data Set 1:Mouse Array – Homozygous ModelMouse Array – Homozygous Model
Average Call Rate The number of bases called divided by the total number of bases possible.
95.92%
Overall Accuracy For all bases where a call is made, the percentage that agrees with capillary sequencing
99.99%
Overall Reproducibility For a pair of technical replicate chips (pairs of mouse samples in this case), concordance is computed for all sites where the two arrays make a call. N’s excluded.
>99.99%
Homozygous SNP Call Rate
Percentage of calls made for all known SNP positions. 95.95%
Homozygous SNP Accuracy
Percentage of call a SNP when capillary sequence called the same base a SNP Does not include N’s or SNPs within 9 bp of another SNP
100%
Homozygous SNP False Positive
Percentage of calls made as a SNP when capillary sequencing called the base a reference.
0.01%
Homozygous SNP False Negative
Percentage of N calls made when capillary sequencing called the base as SNP Does not include SNPs within 9 bp of another SNP Calculated for individual genotypes, not SNP sites
4.05%
Homozygous SNP Reproducibility
Same as overall reproducibility, but for SNP sites only. 100%
Data Set 2: Data Set 2: Diploid Analysis of Encode Interval on Chr 4Diploid Analysis of Encode Interval on Chr 4
• Collected array data and dideoxy sequence data from 16 diploid CEPH individuals across 115kbp of non-repetitive Chr4 sequence = 1.84Mbp in total
• Dideoxy sequencing data• Total of 1.44Mbp covered by dideoxy sequencing• Each variant was confirmed by genotyping in all 16 DNAs
• Array data• 27 LR-PCRs amplified ~250kbp of genomic sequence • LR-PCRs pooled, fragmented, labeled, hybed per SOP• Hybed one array per individual to query 115kb of non-repetitive
sequence• Intensity data analysed using GSEQ v3.0
• in diploid mode• at various quality threshold values
Data Set 2 Diploid ENCODE Region
Call RateThe number of non-N calls divided by the total number of calls 96.56%
Overall AccuracyPercentage of all calls (excluding Ns) that are concordant with ENCODE data 99.95%
Call rate at variant sites Percentage of calls made for all known SNP loci including heteroygous and homozygous calls 89.70%
SNP False Negative RatePercentage of variant positions in the ENCODE data that are called N or reference in the array data 17.34%
SNP False Positive RatePercentage of reference positions in the ENCODE data that are called variant in the array data 0.04%
Homozygous Accuracy
Percentage of homozygous variant positions in the ENCODE data with concordant array data (excluding array Ns) 96.91%
Heterozygous AccuracyPercentage of heterozygous positions in the ENCODE data with concordant array data (excluding array Ns) 86.25%
Homozygous SNP False NegativePercentage of mis-calls (No Calls and Ref calls) made for all known homozygous SNP positions in the ENCODE data 9.12%
Heterozygous SNP False NegativePercentage of mis-calls (No Calls and Ref calls) made for all known heterozygous SNP positions in the ENCODE data 22.15%
Post GSEQ Filters to Reduce False Post GSEQ Filters to Reduce False PositivesPositives
Summary of exclusions - position*sample specific cell counts Calls removed
# FPs removed
%FPs removed
# TPs removed
%TPs removed
PCR Failure 19519 168 31.28% 5 0.33%
Nearby SNPs – Footprint 252 167 31.10% 14 0.33%
Cross Hybridization sites 64 33 6.15% 0 0.00%
Low Complexity Probes 128 1 0.19% 0 0.00%
Non-biallelic Calls 32 32 5.96% 0 0.00%
Performance Post GSEQ FiltersPerformance Post GSEQ Filters
Before Filters After Filters
Call Rate 96.56% 95.98%
Overall Accuracy 99.95% 99.98%
False Positive Calls 537 219
False Positive Rate 0.040% 0.016%
True Positive Calls 1498 1479
SNP call False Negative Rate 17.34% 18.52%
SNP site False Negative rate 8.18% 9.39%
Impact of GC Content on Call RatesImpact of GC Content on Call Rates
Call Rate vs. Probe GC content
75.00%
80.00%
85.00%
90.00%
95.00%
100.00%
<10 11-20 21-30 31-40 41-50 51-60 61-70 >70
% probe GC
Cal
l Rat
e
Batch Analysis Improves PerformanceBatch Analysis Improves Performance
Performance as a function of sample size
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
1 2 4 8 16
Number of Samples
Cal
l R
ate/
Fal
se N
egat
ive
Rat
e
0.000%
0.005%
0.010%
0.015%
0.020%
0.025%
Fal
se P
osi
tive
Rat
e
Call Rate
FN rate
FP rate
CustomSeqCustomSeq®® Applications Applications
Haploid– Pathogen identification and typing– Mitochondria
Diploid– Candidate genes– Regions of linkage/association– Pharmacogenomics
Microarray-based Resequencing of Microarray-based Resequencing of Multiple Multiple Bacillus anthracisBacillus anthracis Isolates Isolates
Zwick, ME. et al., Genome Biology, 6:R10 (2004)
Bacillus anthracisBacillus anthracis Research Research
Rapid, accurate, and inexpensive resequencing required for a variety of applications and studies.
– Definitively identify B. anthracis in environmental and clinical samples
– Determine forensic attribution and detect genetic manipulation
– Determine phylogenetic relationships of strains– Uncover the genetic basis of phenotypic variation in traits
such as mammalian virulence.
Neither the AFLP nor the MLST studies discover and genotype sufficient genetic variation to distinguish between B. anthracis strains
Sequencing efforts are increasing but limited by cost.
Zwick, ME. et al., Genome Biology, 6:R10 (2004)
Bacillus anthracisBacillus anthracis Custom Array Custom Array
Experimental Design
Array– 30kb CustomSeq® array containing 29,212bp of
unique sequence
Samples– 56 isolates from Biological Defense Research
Directorate's strain collection – Samples hybridized in replicate on 2 arrays
Assay– Long range PCR
Analysis- ABACUS software
Zwick, ME. et al., Genome Biology, 6:R10 (2004)
Bacillus anthracisBacillus anthracis Custom Array - Custom Array - ResultsResults
Results – Call Rate and Reproducibility– 115/ 118 array hybridizations successful– Average call rate = 92.6%– High reproducibility- only one discrepancy found between
replicates across 1.35Mb of sequence
Replication experiment
Total number of bases called in replicate 1 1,383,229
Total number of bases called in replicate 2 1,373,905
Total number of bases called in both replicates 1,349,177
Total number of bases called differently 1
Replication experiment discrepancy rate 7.4E-07
Zwick, ME. et al., Genome Biology, 6:R10 (2004)
Results - Accuracy – 30 arrays hybridized to anthrax strains previously
sequenced by capillary sequencing– 15 discrepancies/ 6 SNP sites
10 discrepancies /5 sites resolved as arrays agreed with most recent shot gun assembly
1 site accounting for 5 discrepancies could not be confirmed based on a single read with phred score of 7
Accuracy estimation experiment
Total number of bases called identically 398,452
Total number of bases called differently 15
Accuracy experiment discrepancy rate 3.8E-05
Bacillus anthracisBacillus anthracis Custom Array - Custom Array - ResultsResults
Zwick, ME. et al., Genome Biology, 6:R10 (2004)
Bacillus anthracisBacillus anthracis Custom Array - Custom Array - ConclusionsConclusions
Study demonstrated that microarray-based resequencing is technologically robust and generates highly replicable and accurate data when compared to alternative sequence technologies
In this experiment, 115 arrays, or 97.5% of the total attempted, were processed successfully, obtaining an average high-quality base-calling rate of 92.6%
Called bases are shown to be highly replicable (discrepancy rate of 7.4 × 10-7) and accurate when compared to conventional shotgun sequence (discrepancy rate of < 2.5 × 10-6)
Zwick, ME. et al., Genome Biology, 6:R10 (2004)
Sequencing Arrays for Screening Multiple Sequencing Arrays for Screening Multiple Genes Associated with Early-Onset Human Genes Associated with Early-Onset Human
Retinal Degenerations on a High-Throughput Retinal Degenerations on a High-Throughput PlatformPlatform
Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)
* Study conducted at the Dept of Ophthalmology and Visual Science at the University of Michigan, Ann Arbor in collaboration with McGill University Health Science Center and the NEI/NIH
Progressive retinal degeneration leading to irreversible blindness or severe visual impairment
Affects 1:3500 individuals worldwide
Broad genetic heterogeneity with at least 32 genes known to be associated with various forms (AD, AR, X linked) of RP.
Several treatments are in development but response to individual treatments is likely to be linked to genotype.
Screening all known genes (~60kb) is inefficient by traditional methods.
Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)
Retinitis pigmentosaRetinitis pigmentosa
Array– 11 RP genes (coding and flanking regions) representing 25.8kb
unique sequence were tiled on a 30kb CustomSeq Array
Samples– 35 cases with known genotypes – 35 novel cases – 26 unaffected family members
Assay– Standard CustomSeq protocol– Traditional PCR -159 amplicons
Retinitis pigmentosaRetinitis pigmentosa Custom Array Custom Array
Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)
Retinitis pigmentosaRetinitis pigmentosa Array- Results Array- Results
Base calling Performance– Average Call Rates = 97.60% (individual arrays ranged
96.0%-98.5%)– Accuracy = >99%– Reproducibility = >99%
SNP Detection– 506 sequence changes identified
Accurately detected 382 previously reported SNPs and identified 113 novel SNPs
Accurately detected 5 previously reported mutations and identified 7 novel rare mutations
Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)
Retinitis pigmentosaRetinitis pigmentosa Array- Results Array- Results
Summary of Novel Potentially Pathogenic Nucleotide Changes and Previously Reported Mutations Detected in Patient DNA, with the arRP-I Arrays
Patient GeneNucleotideChange
Amino AcidChange
Genotype Reference
KE727 RHO C959A Thr320Asn Het Novel
KE1246 CRB1 G2473A Glu825Lys Het Novel
R165 TULP1 IVS2 _ 3 A _ G Homo Novel
R206 ABCA4 G1699A Val567Met Het Novel
KE869 RGR C734T Ser245Phe Het Novel
R353 MERTK G500A Arg167His Het Novel
R376 ABCA4 IVS23-2 A _ T Het Novel
KE385 RPE65 T963G Asn321Lys Het Known
KE1246 ABCA4 T3602G Leu1201Arg Het Known
KE1246 ABCA4 G5077A Val1693Ile Het Known
R376 ABCA4 C5327T Pro1776Leu Het Known
Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)
Retinitis pigmentosaRetinitis pigmentosa Array- Conclusion Array- Conclusion
Resequencing arrays provide an efficient and reliable method of high-throughput screening for mutations in genetically heterogeneous diseases.
Enables one to screen multiple genes and enables the analysis of both Mendelian and complex forms of retinal degeneration
Comparison of material costs revealed that arrays were 23% cheaper. Time and labor savings further increased the cost effectiveness of this method
Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)
A Transforming MET Mutation Discovered in A Transforming MET Mutation Discovered in Non-small Cell Lung Cancer Using Non-small Cell Lung Cancer Using Microarray-based Resequencing Microarray-based Resequencing
Tengs T., et al. Cancer Lett. (2005)
* Study conducted at the Dana Farber Cancer Institute in collaboration with MIT/Broad Institute and Merck Pharmaceuticals
Custom Cancer ArrayCustom Cancer Array
Objective– Evaluate the performance of resequencing arrays to
detect mutations in oncogenes and tumor suppressor genes
– Identify novel mutations which may have an impact on therapeutic response
Experimental design– 164 exons (23,966 bp) from genes associated with
cancer – Sequenced 20 lung tumor samples with matched
normal controls– Dideoxy sequencing was performed on a subset of
exons in order to evaluate the performance of the arrays
Tengs T, et al. Cancer Lett. (2005)
Custom Cancer Array PerformanceCustom Cancer Array Performance
Coverage and accuracy of resequencing arrays when compared to dideoxy sequencing
– Call rate - 97.53%– Overall accuracy- 99.99%– Only 4 SNP call errors reported
3 hom SNP called het SNP/ 1 het SNP called ref
Exons also covered by dideoxy sequencing Total number of bases interrogated 335,420
Total number of ’no calls’ made by GDAS 8,283 (2.47%)
Coverage 327,137 (97.53%)
Number of homozygous mutations found by dideoxy sequencing in loci where GDAS made calls 37
Number of heterozygous mutations found by dideoxy sequencing in loci where GDAS made calls 71
Total number of ’no calls’ made by GDAS in mutated loci 11 (9.24%)
Homozygous mutations called correctly by GDAS 34 (91.89%)
Heterozygous mutations called correctly by GDAS 70 (98.59%)
Total number of correct calls by GDAS in loci covered 327,132 (99.99%)
Tengs T, et al. Cancer Lett. (2005)
Amino acid changing mutations detected in the Amino acid changing mutations detected in the 20 NSCLC samples20 NSCLC samples
Gene Refseq Nucleotide change Amino acid change Origin Heterozygous/ homozygous
CDKN2A NM_000077 G654A A148T Germline 2/0b
CDKN2A NM_000077 T556A V115E Germline 1/0
KRAS2 NM_004985 G216A G12D Somatic 2/1
KRAS2 NM_004985 G216T G12V Somatic 1/0
MET NM_000245 A1311G N375S Germline 3/0
MET NM_000245 C2646T P814S Germline 1/0
MET NM_000245 C3162T T1010I Germline 1/0
NRAS NM_002524 A435T Q61L Germline 1/0
PTEN NM_000314 G1266A A79T Germline 1/0
RET NM_020630 G1645A D489N Germline 1/0
RET NM_020630 G2251A G691S Germline 8/0
RET NM_020630 C3124T R982C Germline 1/0
TP53 NM_000546 G1075T C275F Somatic 1/0
TP53 NM_000546 G984T G245C Somatic 1/0
TP53 NM_000546 G714T R158L Somatic 2/0
TP53 NM_000546 G775A R175H Somatic 1/0
TP53 NM_000546 G997T R249M Somatic 1/0
TP53 NM_000546 C1167T R306Stop Somatic 1/0
TP53 NM_000546 G466C R72P Germline 4/1
TP53 NM_000546 GOT splice-site Somatic 1/0
TP53 NM_000546 G1065T V272L Somatic 1/0
TP53 NM_000546 A739G Y163C Somatic 1/0
Tengs T, et al. Cancer Lett. (2005)
Custom Cancer Array- ConclusionsCustom Cancer Array- Conclusions
Results show that resequencing microarrays can be used as a tool for cancer mutation detection and discovery
The overall performance of the platform is comparable to traditional Sanger-based sequencing with a very high concordance rate (327,132 out of 327,137 bases called consistently; >99.99% concordance)
Furthermore, we have found the transforming MET mutation T1010I in NSCLC to be present in a small fraction of lung tumors. Since MET inhibitors are currently being evaluated in lung cancer, it is tempting to speculate that they might prove beneficial in a subset of lung tumors with activated MET tyrosine kinase
Tengs T, et al. Cancer Lett. (2005)
SummarySummary
CustomSeq® Resequencing arrays have proven to be a valuable tool for a variety of applications including microbial research, mitochondrial analysis, and resequencing of genes involved in heterogeneous diseases
CustomSeq arrays provide an efficient and cost-effective method for large-scale sequence variation analysis
Resequencing arrays provide high-quality sequence information
– Call Rates >90%– Accuracy >99.9%– Reproducibility >99.9%
Resequencing Arrays facilitate large-scale comparative sequencing projects by providing significant benefits in terms of ease of use and data analysis