19
Informatics for next- generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Informatics for next-generation sequence analysis – SNP calling

Gabor T. MarthBoston College Biology Department

PSB 2008January 4-8. 2008

Page 2: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Read length and throughput

read length

base

s p

er

mach

ine r

un

10 bp 1,000 bp100 bp

100 Mb

10 Mb

1Mb

1Gb

Illumina/Solexa, AB/SOLiD short-read sequencers

ABI capillary sequencer

454 pyrosequencer(20-100 Mb in 100-250 bp reads)

(1-4 Gb in 25-50 bp reads)

Page 3: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Current and future application areas

• Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery

• De novo genome sequencing

• Short-read sequencing will be (at least) an alternative to micro-arrays for:

• DNA-protein interaction analysis (CHiP-Seq)• novel transcript discovery• quantification of gene expression• epigenetic analysis (methylation profiling)

DELSNP

reference genome

Page 4: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Fundamental informatics challenges (I)

1. Interpreting machine readouts – base calling, base error estimation

2. Dealing with non-uniqueness in the genome: resequenceability

3. Alignment of billions of reads

Page 5: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Informatics challenges (II)

5. Data visualization

4. SNP and short INDEL, and structural variation discovery

6. Data storage & management

Page 6: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Resequencing-based SNP discovery

genome reference sequence

Read mapping

Read alignment

Paralog identification

SNP detection + inspection

Page 7: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

SNP calling workflow

• read alignment

• SNP detection

• visual checking

Page 8: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Bayesian detection algorithm

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

A

A

A

A

A

C

C

C

C

C

T

T

T

T

T

G

G

G

G

G

polymorphic combination

monomorphic combinationBayesian

posterior probability i.e. the SNP score

Base call + Base quality Polymorphism rate (prior)

Base composition Depth of coverage

Page 9: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Base quality values for SNP calling

• base quality values help us decide if mismatches are true polymorphisms or sequencing errors• accurate base qualities are crucial, especially in lower coverage

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

Page 10: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Priors for specific resequencing scenarios

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA

AACGTTCGCATAAACGTTCGCATA

AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA

strain 1

strain 2

strain 3

AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTAGCATAAACGTTAGCATA

individual 1

individual 3

individual 2

Page 11: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Consensus sequence generation (genotyping)

AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA

AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA

strain 1

strain 2

strain 3

AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTAGCATAAACGTTAGCATA

individual 1

individual 3

individual 2

AACGTTCGCATAAACGTTCGCATA

A

C

A

A/C

C/C

A/A

Page 12: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

SNP calling in Roche/454 pyrosequences

Page 13: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

SNP calling in low 454 coverage

• with Andy Clark (Cornell) and Elaine Mardis (Wash. U.)• 10 different African and American melanogaster isolates• 10 runs of 454 reads (~300,000 reads per isolate) (~1.5X total)• can we detect SNPs in survey-style 454 read coverage?

DNA courtesy of Chuck Langley, UC Davis

iso-1 reference

46-2 454 read

46-2 ABI reads (2 fwd + 2 rev)

• 92.9 % validation rate (1,342 / 1,443)• 2.0% missed SNP rate (25 / 1247)

Page 14: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

SNP calling in Illumina/Solexa short-reads

Page 15: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

SNP calling in short-read coverage

C. elegans reference genome (Bristol, N2 strain)

Pasadena, CB4858(1 ½ machine runs)

• SNP calling error rate very low:

Validation rate = 97.8% (224/229)Conversion rate = 92.6% (224/242)Missed SNP rate = 3.75% (26/693)

SNP

INS

• INDEL candidates validate and convert at similar rates to SNPs:

Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)

Page 16: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

A C G G T C G T C G T G T G C G T

A C G G T C G T C G T G T G C G T

A C G G T C G C C G T G T G C G T

A C G G T C G T C G T G T G C G T

No change

SNP

Measurementerror

SNP calling in AB/SOLiD color-space reads

Page 17: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Mutational profiling: deep 454/Illumina/SOLiD data

• collaboration with Doug Smith at Agencourt

• Pichia stipitis converts xylose to ethanol (bio-fuel production)

• one mutagenized strain had especially high conversion

efficiency

• determine where the mutations were that caused this

phenotype

• we resequenced the 15MB genome with 454 Illumina, and

SOLiD reads

• 14 true point mutations in the entire genome

• In about 15X nominal coverage each technology can find

every point mutation with essentially no false positives

Pichia stipitis reference sequence

Image from JGI web site

Page 18: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Our software is available for testing

http://bioinformatics.bc.edu/marthlab/Beta_Release

Page 19: Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Credits

http://bioinformatics.bc.edu/marthlab

Elaine Mardis (Washington University)Andy Clark (Cornell University)Doug Smith (Agencourt)

Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.)

Derek BarnettEric Tsung

Aaron QuinlanDamien Croteau-Chonka

Weichun Huang

Michael Stromberg

Chip Stewart

Michele Busby