View
220
Download
0
Embed Size (px)
Citation preview
Informatics for next-generation sequence analysis – SNP calling
Gabor T. MarthBoston College Biology Department
PSB 2008January 4-8. 2008
Read length and throughput
read length
base
s p
er
mach
ine r
un
10 bp 1,000 bp100 bp
100 Mb
10 Mb
1Mb
1Gb
Illumina/Solexa, AB/SOLiD short-read sequencers
ABI capillary sequencer
454 pyrosequencer(20-100 Mb in 100-250 bp reads)
(1-4 Gb in 25-50 bp reads)
Current and future application areas
• Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery
• De novo genome sequencing
• Short-read sequencing will be (at least) an alternative to micro-arrays for:
• DNA-protein interaction analysis (CHiP-Seq)• novel transcript discovery• quantification of gene expression• epigenetic analysis (methylation profiling)
DELSNP
reference genome
Fundamental informatics challenges (I)
1. Interpreting machine readouts – base calling, base error estimation
2. Dealing with non-uniqueness in the genome: resequenceability
3. Alignment of billions of reads
Informatics challenges (II)
5. Data visualization
4. SNP and short INDEL, and structural variation discovery
6. Data storage & management
Resequencing-based SNP discovery
genome reference sequence
Read mapping
Read alignment
Paralog identification
SNP detection + inspection
SNP calling workflow
• read alignment
• SNP detection
• visual checking
Bayesian detection algorithm
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11
A
A
A
A
A
C
C
C
C
C
T
T
T
T
T
G
G
G
G
G
polymorphic combination
monomorphic combinationBayesian
posterior probability i.e. the SNP score
Base call + Base quality Polymorphism rate (prior)
Base composition Depth of coverage
Base quality values for SNP calling
• base quality values help us decide if mismatches are true polymorphisms or sequencing errors• accurate base qualities are crucial, especially in lower coverage
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11
Priors for specific resequencing scenarios
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11
AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA
AACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA
strain 1
strain 2
strain 3
AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATA
individual 1
individual 3
individual 2
Consensus sequence generation (genotyping)
AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA
AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA
strain 1
strain 2
strain 3
AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATA
individual 1
individual 3
individual 2
AACGTTCGCATAAACGTTCGCATA
A
C
A
A/C
C/C
A/A
SNP calling in Roche/454 pyrosequences
SNP calling in low 454 coverage
• with Andy Clark (Cornell) and Elaine Mardis (Wash. U.)• 10 different African and American melanogaster isolates• 10 runs of 454 reads (~300,000 reads per isolate) (~1.5X total)• can we detect SNPs in survey-style 454 read coverage?
DNA courtesy of Chuck Langley, UC Davis
iso-1 reference
46-2 454 read
46-2 ABI reads (2 fwd + 2 rev)
• 92.9 % validation rate (1,342 / 1,443)• 2.0% missed SNP rate (25 / 1247)
SNP calling in Illumina/Solexa short-reads
SNP calling in short-read coverage
C. elegans reference genome (Bristol, N2 strain)
Pasadena, CB4858(1 ½ machine runs)
• SNP calling error rate very low:
Validation rate = 97.8% (224/229)Conversion rate = 92.6% (224/242)Missed SNP rate = 3.75% (26/693)
SNP
INS
• INDEL candidates validate and convert at similar rates to SNPs:
Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)
A C G G T C G T C G T G T G C G T
A C G G T C G T C G T G T G C G T
A C G G T C G C C G T G T G C G T
A C G G T C G T C G T G T G C G T
No change
SNP
Measurementerror
SNP calling in AB/SOLiD color-space reads
Mutational profiling: deep 454/Illumina/SOLiD data
• collaboration with Doug Smith at Agencourt
• Pichia stipitis converts xylose to ethanol (bio-fuel production)
• one mutagenized strain had especially high conversion
efficiency
• determine where the mutations were that caused this
phenotype
• we resequenced the 15MB genome with 454 Illumina, and
SOLiD reads
• 14 true point mutations in the entire genome
• In about 15X nominal coverage each technology can find
every point mutation with essentially no false positives
Pichia stipitis reference sequence
Image from JGI web site
Our software is available for testing
http://bioinformatics.bc.edu/marthlab/Beta_Release
Credits
http://bioinformatics.bc.edu/marthlab
Elaine Mardis (Washington University)Andy Clark (Cornell University)Doug Smith (Agencourt)
Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.)
Derek BarnettEric Tsung
Aaron QuinlanDamien Croteau-Chonka
Weichun Huang
Michael Stromberg
Chip Stewart
Michele Busby