GeneChip ® CustomSeq ® Resequencing Arrays. GeneChip ® CustomSeq ® Arrays Most efficient and cost-effective method for large- scale sequence variation

GeneChipGeneChip®® CustomSeq CustomSeq®® Resequencing ArraysResequencing Arrays

GeneChipGeneChip®® CustomSeq CustomSeq®® Arrays Arrays

Most efficient and cost-effective method for large- scale sequence variation analysis

– up to 300kb on single array for >.01 cent per base

Flexible array formats for a variety of applications; 50kb, 100kb, and 300kb

Long “read length” minimizes curation and assembly time

High-quality data – Accuracy >99.9%– Reproducibility >99.9%

Resequencing Assay OverviewResequencing Assay Overview

Hybridization, Wash, and StainHybridization, Wash, and StainScanScanEnd-labeled fragmentsEnd-labeled fragments

Select Select areas of areas of interest by interest by PCRPCR

LabelLabelGenomic DNAGenomic DNA

Pool and Pool and fragmentfragment

PCR productsPCR products

DNA fragmentsDNA fragments

B

BB

B

B

BB

GeneChipGeneChip®® Resequencing System Resequencing System

GeneChip Sequence Analysis Software (GSEQ) 4.0

GeneChip CustomSeq® ArrayGeneChip Resequencing Assay Kit GeneChip Scanner 3000

GeneChip Operating Software (GCOS) 1.4

Data Collection Data Management Client-Server Option

Resequencing Tiling StrategyResequencing Tiling Strategy

ATCGGTAGCCATGCATGAGTTACTATAGCCATCGGTACGTACTCAATGAT

ATCGGTAGCCATGCATGAGTTACTA

TAGCCATCGGTAGGTACTCAATGAT

CAGCTGTCGA

ATCGGTAGCCATCCATGAGTTACTA

TAGCCATCGGTACGTACTCAATGAT

ATCGGTAGCCATTCATGAGTTACTA

TAGCCATCGGTATGTACTCAATGAT

ATCGGTAGCCATACATGAGTTACTA

TAGCCATCGGTAAGTACTCAATGAT


TCGGTAGCCATGTATGAGTTACTAC

AGCCATCGGTAGATACTCAATGATG

CAGCTGTCGA

TCGGTAGCCATGGATGAGTTACTAC

AGCCATCGGTAGCTACTCAATGATG

TCGGTAGCCATGCATGAGTTACTAC

AGCCATCGGTAGGTACTCAATGATG

TCGGTAGCCATGAATGAGTTACTAC

AGCCATCGGTAGTTACTCAATGATG



CGGTAGCCATGCTTGAGTTACTACA

GCCATCGGTACGAACTCAATGATGT

CAGCTGTCGA

CGGTAGCCATGCGTGAGTTACTACA

GCCATCGGTACGCACTCAATGATGT

CGGTAGCCATGCCTGAGTTACTACA

GCCATCGGTACGGACTCAATGATGT

CGGTAGCCATGCATGAGTTACTACA

GCCATCGGTACGTACTCAATGATGT



GGTAGCCATGCATGAGTTACTACAG

CCATCGGTACGTACTCAATGATGTC

CAGCTGTCGA

GGTAGCCATGCAGGAGTTACTACAG

CCATCGGTACGTCCTCAATGATGTC

GGTAGCCATGCACGAGTTACTACAG

CCATCGGTACGTGCTCAATGATGTC

GGTAGCCATGCAAGAGTTACTACAG

CCATCGGTACGTTCTCAATGATGTC



GTAGCCATGCATTAGTTACTACAGC

CATCGGTACGTAATCAATGATGTCG

CAGCTGTCGA

GTAGCCATGCATGAGTTACTACAGC

CATCGGTACGTACTCAATGATGTCG

GTAGCCATGCATCAGTTACTACAGC

CATCGGTACGTAGTCAATGATGTCG

GTAGCCATGCATAAGTTACTACAGC

CATCGGTACGTATTCAATGATGTCG



TAGCCATGCATGTGTTACTACAGCT

ATCGGTACGTACACAATGATGTCGA

CAGCTGTCGA

TAGCCATGCATGGGTTACTACAGCT

ATCGGTACGTACCCAATGATGTCGA

TAGCCATGCATGCGTTACTACAGCT

ATCGGTACGTACGCAATGATGTCGA

TAGCCATGCATGAGTTACTACAGCT

ATCGGTACGTATTCAATGATGTCGA


Resequencing Array PerformanceResequencing Array Performance

* Average call rate reported for arrays where all target has been amplified to sufficient quantity

Call rates* > 90%

Overall Accuracy > 99.9%

Reproducibility > 99.9%

Performance may vary depending on genomic content on individual custom designs. Specific factors that impact performance include:

– GC content– INDELs– Divergence from reference sequence– Multiple SNPs within 10bp

Performance characterized across several data sets–Data Set 1: Mouse – 30kb of sequence in 9 inbred DBA (diploid genome with homozygous SNPs)–Data Set 2: ENCODE – 300kb across 16 CEPH (diploid genome with heterozygous SNPs)–Data Set 3: Mitochondrial – 16kb across 3 replicates of 1 reference (haploid genome with heteroplasmy)

Quality Threshold ScoreQuality Threshold ScoreImpact on call rate verses accuracyImpact on call rate verses accuracy

99.75%

99.80%

99.85%

99.90%

99.95%

100.00%

0 3 6 9 12

QTS

Acc

ura

cy

65%

70%

75%

80%

85%

90%

95%

100%

Cal

l R

ate

Overall Accuracy Call Rate

99.75%

99.80%

99.85%

99.90%

99.95%

100.00%

0 3 6 9 12

QTS

Acc

ura

cy

65%

70%

75%

80%

85%

90%

95%

100%

Cal

l R

ate

Overall Accuracy Call Rate

Homozygous Model

Heterozygous Model

Data Set 1:Data Set 1:Mouse Array – Homozygous ModelMouse Array – Homozygous Model

Average Call Rate The number of bases called divided by the total number of bases possible.

95.92%

Overall Accuracy For all bases where a call is made, the percentage that agrees with capillary sequencing

99.99%

Overall Reproducibility For a pair of technical replicate chips (pairs of mouse samples in this case), concordance is computed for all sites where the two arrays make a call. N’s excluded.

>99.99%

Homozygous SNP Call Rate

Percentage of calls made for all known SNP positions. 95.95%

Homozygous SNP Accuracy

Percentage of call a SNP when capillary sequence called the same base a SNP Does not include N’s or SNPs within 9 bp of another SNP

100%

Homozygous SNP False Positive

Percentage of calls made as a SNP when capillary sequencing called the base a reference.

0.01%

Homozygous SNP False Negative

Percentage of N calls made when capillary sequencing called the base as SNP Does not include SNPs within 9 bp of another SNP Calculated for individual genotypes, not SNP sites

4.05%

Homozygous SNP Reproducibility

Same as overall reproducibility, but for SNP sites only. 100%

Data Set 2: Data Set 2: Diploid Analysis of Encode Interval on Chr 4Diploid Analysis of Encode Interval on Chr 4

• Collected array data and dideoxy sequence data from 16 diploid CEPH individuals across 115kbp of non-repetitive Chr4 sequence = 1.84Mbp in total

• Dideoxy sequencing data• Total of 1.44Mbp covered by dideoxy sequencing• Each variant was confirmed by genotyping in all 16 DNAs

• Array data• 27 LR-PCRs amplified ~250kbp of genomic sequence • LR-PCRs pooled, fragmented, labeled, hybed per SOP• Hybed one array per individual to query 115kb of non-repetitive

sequence• Intensity data analysed using GSEQ v3.0

• in diploid mode• at various quality threshold values

Data Set 2 Diploid ENCODE Region

Call RateThe number of non-N calls divided by the total number of calls 96.56%

Overall AccuracyPercentage of all calls (excluding Ns) that are concordant with ENCODE data 99.95%

Call rate at variant sites Percentage of calls made for all known SNP loci including heteroygous and homozygous calls 89.70%

SNP False Negative RatePercentage of variant positions in the ENCODE data that are called N or reference in the array data 17.34%

SNP False Positive RatePercentage of reference positions in the ENCODE data that are called variant in the array data 0.04%

Homozygous Accuracy

Percentage of homozygous variant positions in the ENCODE data with concordant array data (excluding array Ns) 96.91%

Heterozygous AccuracyPercentage of heterozygous positions in the ENCODE data with concordant array data (excluding array Ns) 86.25%

Homozygous SNP False NegativePercentage of mis-calls (No Calls and Ref calls) made for all known homozygous SNP positions in the ENCODE data 9.12%

Heterozygous SNP False NegativePercentage of mis-calls (No Calls and Ref calls) made for all known heterozygous SNP positions in the ENCODE data 22.15%

Post GSEQ Filters to Reduce False Post GSEQ Filters to Reduce False PositivesPositives

Summary of exclusions - position*sample specific cell counts Calls removed

# FPs removed

%FPs removed

# TPs removed

%TPs removed

PCR Failure 19519 168 31.28% 5 0.33%

Nearby SNPs – Footprint 252 167 31.10% 14 0.33%

Cross Hybridization sites 64 33 6.15% 0 0.00%

Low Complexity Probes 128 1 0.19% 0 0.00%

Non-biallelic Calls 32 32 5.96% 0 0.00%

Performance Post GSEQ FiltersPerformance Post GSEQ Filters

Before Filters After Filters

Call Rate 96.56% 95.98%

Overall Accuracy 99.95% 99.98%

False Positive Calls 537 219

False Positive Rate 0.040% 0.016%

True Positive Calls 1498 1479

SNP call False Negative Rate 17.34% 18.52%

SNP site False Negative rate 8.18% 9.39%

Impact of GC Content on Call RatesImpact of GC Content on Call Rates

Call Rate vs. Probe GC content

75.00%

80.00%

85.00%

90.00%

95.00%

100.00%

<10 11-20 21-30 31-40 41-50 51-60 61-70 >70

% probe GC

Cal

l Rat

e

Batch Analysis Improves PerformanceBatch Analysis Improves Performance

Performance as a function of sample size

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

1 2 4 8 16

Number of Samples

Cal

l R

ate/

Fal

se N

egat

ive

Rat

e

0.000%

0.005%

0.010%

0.015%

0.020%

0.025%

Fal

se P

osi

tive

Rat

e

Call Rate

FN rate

FP rate

CustomSeqCustomSeq®® Applications Applications

Haploid– Pathogen identification and typing– Mitochondria

Diploid– Candidate genes– Regions of linkage/association– Pharmacogenomics

Microarray-based Resequencing of Microarray-based Resequencing of Multiple Multiple Bacillus anthracisBacillus anthracis Isolates Isolates

Zwick, ME. et al., Genome Biology, 6:R10 (2004)

Bacillus anthracisBacillus anthracis Research Research

Rapid, accurate, and inexpensive resequencing required for a variety of applications and studies.

– Definitively identify B. anthracis in environmental and clinical samples

– Determine forensic attribution and detect genetic manipulation

– Determine phylogenetic relationships of strains– Uncover the genetic basis of phenotypic variation in traits

such as mammalian virulence.

Neither the AFLP nor the MLST studies discover and genotype sufficient genetic variation to distinguish between B. anthracis strains

Sequencing efforts are increasing but limited by cost.


Bacillus anthracisBacillus anthracis Custom Array Custom Array

Experimental Design

Array– 30kb CustomSeq® array containing 29,212bp of

unique sequence

Samples– 56 isolates from Biological Defense Research

Directorate's strain collection – Samples hybridized in replicate on 2 arrays

Assay– Long range PCR

Analysis- ABACUS software


Bacillus anthracisBacillus anthracis Custom Array - Custom Array - ResultsResults

Results – Call Rate and Reproducibility– 115/ 118 array hybridizations successful– Average call rate = 92.6%– High reproducibility- only one discrepancy found between

replicates across 1.35Mb of sequence

Replication experiment

Total number of bases called in replicate 1 1,383,229

Total number of bases called in replicate 2 1,373,905

Total number of bases called in both replicates 1,349,177

Total number of bases called differently 1

Replication experiment discrepancy rate 7.4E-07


Results - Accuracy – 30 arrays hybridized to anthrax strains previously

sequenced by capillary sequencing– 15 discrepancies/ 6 SNP sites

10 discrepancies /5 sites resolved as arrays agreed with most recent shot gun assembly

1 site accounting for 5 discrepancies could not be confirmed based on a single read with phred score of 7

Accuracy estimation experiment

Total number of bases called identically 398,452

Total number of bases called differently 15

Accuracy experiment discrepancy rate 3.8E-05

Bacillus anthracisBacillus anthracis Custom Array - Custom Array - ResultsResults


Bacillus anthracisBacillus anthracis Custom Array - Custom Array - ConclusionsConclusions

Study demonstrated that microarray-based resequencing is technologically robust and generates highly replicable and accurate data when compared to alternative sequence technologies

In this experiment, 115 arrays, or 97.5% of the total attempted, were processed successfully, obtaining an average high-quality base-calling rate of 92.6%

Called bases are shown to be highly replicable (discrepancy rate of 7.4 × 10-7) and accurate when compared to conventional shotgun sequence (discrepancy rate of < 2.5 × 10-6)


Sequencing Arrays for Screening Multiple Sequencing Arrays for Screening Multiple Genes Associated with Early-Onset Human Genes Associated with Early-Onset Human

Retinal Degenerations on a High-Throughput Retinal Degenerations on a High-Throughput PlatformPlatform

Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)

* Study conducted at the Dept of Ophthalmology and Visual Science at the University of Michigan, Ann Arbor in collaboration with McGill University Health Science Center and the NEI/NIH

Progressive retinal degeneration leading to irreversible blindness or severe visual impairment

Affects 1:3500 individuals worldwide

Broad genetic heterogeneity with at least 32 genes known to be associated with various forms (AD, AR, X linked) of RP.

Several treatments are in development but response to individual treatments is likely to be linked to genotype.

Screening all known genes (~60kb) is inefficient by traditional methods.


Retinitis pigmentosaRetinitis pigmentosa

Array– 11 RP genes (coding and flanking regions) representing 25.8kb

unique sequence were tiled on a 30kb CustomSeq Array

Samples– 35 cases with known genotypes – 35 novel cases – 26 unaffected family members

Assay– Standard CustomSeq protocol– Traditional PCR -159 amplicons

Retinitis pigmentosaRetinitis pigmentosa Custom Array Custom Array


Retinitis pigmentosaRetinitis pigmentosa Array- Results Array- Results

Base calling Performance– Average Call Rates = 97.60% (individual arrays ranged

96.0%-98.5%)– Accuracy = >99%– Reproducibility = >99%

SNP Detection– 506 sequence changes identified

Accurately detected 382 previously reported SNPs and identified 113 novel SNPs

Accurately detected 5 previously reported mutations and identified 7 novel rare mutations


Retinitis pigmentosaRetinitis pigmentosa Array- Results Array- Results

Summary of Novel Potentially Pathogenic Nucleotide Changes and Previously Reported Mutations Detected in Patient DNA, with the arRP-I Arrays

Patient GeneNucleotideChange

Amino AcidChange

Genotype Reference

KE727 RHO C959A Thr320Asn Het Novel

KE1246 CRB1 G2473A Glu825Lys Het Novel

R165 TULP1 IVS2 _ 3 A _ G Homo Novel

R206 ABCA4 G1699A Val567Met Het Novel

KE869 RGR C734T Ser245Phe Het Novel

R353 MERTK G500A Arg167His Het Novel

R376 ABCA4 IVS23-2 A _ T Het Novel

KE385 RPE65 T963G Asn321Lys Het Known

KE1246 ABCA4 T3602G Leu1201Arg Het Known

KE1246 ABCA4 G5077A Val1693Ile Het Known

R376 ABCA4 C5327T Pro1776Leu Het Known


Retinitis pigmentosaRetinitis pigmentosa Array- Conclusion Array- Conclusion

Resequencing arrays provide an efficient and reliable method of high-throughput screening for mutations in genetically heterogeneous diseases.

Enables one to screen multiple genes and enables the analysis of both Mendelian and complex forms of retinal degeneration

Comparison of material costs revealed that arrays were 23% cheaper. Time and labor savings further increased the cost effectiveness of this method


A Transforming MET Mutation Discovered in A Transforming MET Mutation Discovered in Non-small Cell Lung Cancer Using Non-small Cell Lung Cancer Using Microarray-based Resequencing Microarray-based Resequencing

Tengs T., et al. Cancer Lett. (2005)

* Study conducted at the Dana Farber Cancer Institute in collaboration with MIT/Broad Institute and Merck Pharmaceuticals

Custom Cancer ArrayCustom Cancer Array

Objective– Evaluate the performance of resequencing arrays to

detect mutations in oncogenes and tumor suppressor genes

– Identify novel mutations which may have an impact on therapeutic response

Experimental design– 164 exons (23,966 bp) from genes associated with

cancer – Sequenced 20 lung tumor samples with matched

normal controls– Dideoxy sequencing was performed on a subset of

exons in order to evaluate the performance of the arrays

Tengs T, et al. Cancer Lett. (2005)

Custom Cancer Array PerformanceCustom Cancer Array Performance

Coverage and accuracy of resequencing arrays when compared to dideoxy sequencing

– Call rate - 97.53%– Overall accuracy- 99.99%– Only 4 SNP call errors reported

3 hom SNP called het SNP/ 1 het SNP called ref

Exons also covered by dideoxy sequencing Total number of bases interrogated 335,420

Total number of ’no calls’ made by GDAS 8,283 (2.47%)

Coverage 327,137 (97.53%)

Number of homozygous mutations found by dideoxy sequencing in loci where GDAS made calls 37

Number of heterozygous mutations found by dideoxy sequencing in loci where GDAS made calls 71

Total number of ’no calls’ made by GDAS in mutated loci 11 (9.24%)

Homozygous mutations called correctly by GDAS 34 (91.89%)

Heterozygous mutations called correctly by GDAS 70 (98.59%)

Total number of correct calls by GDAS in loci covered 327,132 (99.99%)


Amino acid changing mutations detected in the Amino acid changing mutations detected in the 20 NSCLC samples20 NSCLC samples

Gene Refseq Nucleotide change Amino acid change Origin Heterozygous/ homozygous

CDKN2A NM_000077 G654A A148T Germline 2/0b

CDKN2A NM_000077 T556A V115E Germline 1/0

KRAS2 NM_004985 G216A G12D Somatic 2/1

KRAS2 NM_004985 G216T G12V Somatic 1/0

MET NM_000245 A1311G N375S Germline 3/0

MET NM_000245 C2646T P814S Germline 1/0

MET NM_000245 C3162T T1010I Germline 1/0

NRAS NM_002524 A435T Q61L Germline 1/0

PTEN NM_000314 G1266A A79T Germline 1/0

RET NM_020630 G1645A D489N Germline 1/0

RET NM_020630 G2251A G691S Germline 8/0

RET NM_020630 C3124T R982C Germline 1/0

TP53 NM_000546 G1075T C275F Somatic 1/0

TP53 NM_000546 G984T G245C Somatic 1/0

TP53 NM_000546 G714T R158L Somatic 2/0

TP53 NM_000546 G775A R175H Somatic 1/0

TP53 NM_000546 G997T R249M Somatic 1/0

TP53 NM_000546 C1167T R306Stop Somatic 1/0

TP53 NM_000546 G466C R72P Germline 4/1

TP53 NM_000546 GOT splice-site Somatic 1/0

TP53 NM_000546 G1065T V272L Somatic 1/0

TP53 NM_000546 A739G Y163C Somatic 1/0


Custom Cancer Array- ConclusionsCustom Cancer Array- Conclusions

Results show that resequencing microarrays can be used as a tool for cancer mutation detection and discovery

The overall performance of the platform is comparable to traditional Sanger-based sequencing with a very high concordance rate (327,132 out of 327,137 bases called consistently; >99.99% concordance)

Furthermore, we have found the transforming MET mutation T1010I in NSCLC to be present in a small fraction of lung tumors. Since MET inhibitors are currently being evaluated in lung cancer, it is tempting to speculate that they might prove beneficial in a subset of lung tumors with activated MET tyrosine kinase


SummarySummary

CustomSeq® Resequencing arrays have proven to be a valuable tool for a variety of applications including microbial research, mitochondrial analysis, and resequencing of genes involved in heterogeneous diseases

CustomSeq arrays provide an efficient and cost-effective method for large-scale sequence variation analysis

Resequencing arrays provide high-quality sequence information

– Call Rates >90%– Accuracy >99.9%– Reproducibility >99.9%

Resequencing Arrays facilitate large-scale comparative sequencing projects by providing significant benefits in terms of ease of use and data analysis

Documents

GeneChip ® CustomSeq ® Resequencing Arrays. GeneChip ® CustomSeq ® Arrays Most efficient and cost-effective method for large- scale sequence variation