66
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009

Next-generation sequencing: informatics & software aspects

  • Upload
    vaughan

  • View
    55

  • Download
    0

Embed Size (px)

DESCRIPTION

Next-generation sequencing: informatics & software aspects. Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009. New sequencing technologies…. … offer vast throughput … & many applications. 100 Gb. Illumina/Solexa , AB/ SOLiD sequencers. - PowerPoint PPT Presentation

Citation preview

Page 1: Next-generation sequencing: informatics & software aspects

Next-generation sequencing:informatics & software aspects

Gabor T. MarthBoston College Biology Department

Harvard NanocourseOctober 7, 2009

Page 2: Next-generation sequencing: informatics & software aspects

New sequencing technologies…

Page 3: Next-generation sequencing: informatics & software aspects

… offer vast throughput … & many applications

read length

base

s per

mac

hine

run

10 bp 1,000 bp100 bp

1 Gb

100 Mb

10 Mb

10 Gb

Illumina/Solexa, AB/SOLiD sequencers

ABI capillary sequencer

Roche/454 pyrosequencer(100Mb-1Gb in 200-450bp reads)

(10-50Gb in 25-100 bp reads)

1 Mb

100 Gb

Page 4: Next-generation sequencing: informatics & software aspects

Genome resequencing for variation discovery

SNPs

short INDELs

structural variations

• the most immediate application area

Page 5: Next-generation sequencing: informatics & software aspects

Genome resequencing for mutational profiling

Organismal reference sequence

• likely to change “classical genetics” and mutational analysis

Page 6: Next-generation sequencing: informatics & software aspects

De novo genome sequencing

Lander et al. Nature 2001

• difficult problem with short reads

• promising, especially as reads get longer

Page 7: Next-generation sequencing: informatics & software aspects

Identification of protein-bound DNA

Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)

Transcription binding sites. (Robertson et al. Nature Methods, 2007)

DNA methylation. (Meissner et al. Nature 2008)

• natural applications for next-gen. sequencers

Page 8: Next-generation sequencing: informatics & software aspects

Transcriptome sequencing: transcript discovery

Mortazavi et al. Nature Methods 2008

Ruby et al. Cell, 2006

• high-throughput, but short reads pose challenges

Page 9: Next-generation sequencing: informatics & software aspects

Transcriptome sequencing: expression profiling

Jones-Rhoads et al. PLoS Genetics, 2007

Cloonan et al. Nature Methods, 2008

• high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays

Page 10: Next-generation sequencing: informatics & software aspects

… & enable personal genome sequencing

Page 11: Next-generation sequencing: informatics & software aspects

The re-sequencing informatics pipelineREF

(ii) read mappingIND

(i) base calling

IND(iii) SNP and short INDEL calling

(v) data viewing, hypothesis generation

(iv) SV calling GigaBayesGigaBayes

Page 12: Next-generation sequencing: informatics & software aspects

The variation discovery “toolbox”

• base callers• read mappers

• SNP callers

• SV callers

• assembly viewers

GigaBayesGigaBayes

Page 13: Next-generation sequencing: informatics & software aspects

Base error characteristics vary

Inser-tions

1.43%

Dele-tions

3.23%

Substitutions95.34%

Illumina

454

Page 14: Next-generation sequencing: informatics & software aspects

Read lengths vary

read length [bp]0 100 200 300

~200-450 (variable)

25-100(fixed)

25-50 (fixed)

25-60 (variable)

400

Page 15: Next-generation sequencing: informatics & software aspects

Sequence traces are machine-specific

Base calling is increasingly left to machine manufacturers

Page 16: Next-generation sequencing: informatics & software aspects

Representational biases

Page 17: Next-generation sequencing: informatics & software aspects

Fragment duplication

Page 18: Next-generation sequencing: informatics & software aspects

Read mapping is like a jigsaw

… and they give you the picture on the box

2. Read mapping…you get the pieces…

Unique pieces are easier to place than others…

Page 19: Next-generation sequencing: informatics & software aspects

Multiply-mapping reads

• Reads from repeats cannot be uniquely mapped back to their true region of origin

• “Traditional” repeat masking does not capture repeats at the scale of the read length

Page 20: Next-generation sequencing: informatics & software aspects

Dealing with multiple mapping

Page 21: Next-generation sequencing: informatics & software aspects

Paired-end (PE) reads

fragment length: 100 – 600bp

Korbel et al. Science 2007

fragment length: 1 – 10kb

PE reads are now the standard for whole-genome short-read sequencing

Page 22: Next-generation sequencing: informatics & software aspects

Gapped alignments (for INDELs)

Page 23: Next-generation sequencing: informatics & software aspects

The MOSAIK read mapper

• gapped mapper• option to report multiple map locations• aligns 454, Illumina, SOLiD, Helicos reads• works with standard file formats (SRF, FASTQ, SAM/BAM)

Michael Strömberg

Page 24: Next-generation sequencing: informatics & software aspects

Alignment post-processing

0 5 10 15 20 25 30 35 40 45 500

5

10

15

20

25

30

35

40

45

50

55

60

Act

ual b

ase

qual

ity

Called base quality

• quality value re-calibration

• duplicate fragment removal

Page 25: Next-generation sequencing: informatics & software aspects

Data storage requirements

Page 26: Next-generation sequencing: informatics & software aspects

Alignment visualization

• too much data – indexed browsing• too much detail – color coding, show/hide

Page 27: Next-generation sequencing: informatics & software aspects

SNP calling: old problem, new data

sequencing error polymorphism

1 2

1 21

1

1 2

Pr | Pr | Pr , , ,

Pr | Pr | Pr , , ,

Pr , , , |i

kT

ii n

l kT

nk ki i i n

i

nk k l l l li i

iG

n

B T T G G G G

B T T G G G G

G G G B

Page 28: Next-generation sequencing: informatics & software aspects

Allele calling in next-gen data

SNP

INS

New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection

Page 29: Next-generation sequencing: informatics & software aspects

SNP calling in multi-sample read sets

P(G1=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gi=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gn=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(SNP)

“genotype probabilities”

P(B1=aacc|G1=aa)P(B1=aacc|G1=cc)P(B1=aacc|G1=ac)

P(Bi=aaaac|Gi=aa)P(Bi=aaaac|Gi=cc)P(Bi=aaaac|Gi=ac)

P(Bn=cccc|Gn=aa)P(Bn=cccc|Gn=cc)P(Bn=cccc|Gn=ac)

“genotype likelihoods”

Prio

r(G1,.

.,Gi,..

, Gn)

-----a----------a----------c----------c-----

-----a----------a----------a----------a----------c-----

-----c----------c----------c----------c-----

Page 30: Next-generation sequencing: informatics & software aspects

Trio sequencing

2

2

2 22 2

2

2

2

2 2

2

11 12 22

1 111: 1 12 2 11: 111: 1

1 111 12 : 2 1 12 : 2 1 1 12 : 12 2

22 : 22 : 11 122 : 12 2

1 1 111: 1 1 11:2 2 4

Pr | , 1 112 12 : 2 1 12 2

1 122 : 12 2

M M M

F

C M FF

G G G

G

G G GG

2 2 2

2 22 2

2 22

2

2 22 2

1 1 1 11 1 11: 12 4 2 2

1 1 1 1 112 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2

1 1 1 1 122 : 1 1 22 : 1 14 2 4 2 2

1 111: 12 211: 1

1 122 12 : 1 12 : 12

22 : 1FG

2

22

11:2 1 12 : 2 1

222 : 11 122 : 1 1

2 2

• the child inherits one chromosome from each parent• there is a small probability for a de novo (germ-line or somatic) mutation in the child

Page 31: Next-generation sequencing: informatics & software aspects

Alignment visualization

• too much data – indexed browsing• too much detail – color coding, show/hide

Page 32: Next-generation sequencing: informatics & software aspects

Standard data formats

SRF/FASTQ

SAM/BAM

GLF/VCF

Page 33: Next-generation sequencing: informatics & software aspects

Human genome polymorphism projects

common SNPs

Page 34: Next-generation sequencing: informatics & software aspects

Human genome polymorphism discovery

Page 35: Next-generation sequencing: informatics & software aspects

The 1000 Genomes Project

Pilot 1

Pilot 2

Pilot 3

Page 36: Next-generation sequencing: informatics & software aspects

1000G Pilot 3 – exon sequencing

• Targets:1K genes / 10K targets

• Capture: Solid / liquid phase

• Sequencing:454 / IlluminaSE / PE

• Data producers:BaylorBroadSangerWash. U.

• Informatics methods:Multiple read mapping &SNP calling programs

Page 37: Next-generation sequencing: informatics & software aspects

Coverage varies

Page 38: Next-generation sequencing: informatics & software aspects

On/off target capture

ref allele*:45%

non-ref allele*: 54%Target region

SNP(outside target region)

Page 39: Next-generation sequencing: informatics & software aspects

Fragment duplication – revisited

Page 40: Next-generation sequencing: informatics & software aspects

Reference allele bias

(*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346

ref allele*:54%

non-ref allele*: 45%

Page 41: Next-generation sequencing: informatics & software aspects

SNP calling findings

BCM/454 BI/SLX WUGSC/SLX SC/SLX# Samples 32 23 16 11

<read depth> per sample 35 X 62 X 117 X 51 X

# SNPs called 7,200 – 8,400 4,500 – 4,700 3,700 3,500 – 3,700

% dbSNPs 39 - 55 65 - 72 68 75 - 85

Ts/Tv(#SNP) 1.7 – 2.6 1.9 – 2.3 2.3 2.5 – 2.6

# Novel SNPs 3,998 1,550 1,947 892

• based on a method comparison / testing exercise

• 80 samples drawn from the 4 Centers

• read mapping / SNP calling by the Baylor pipeline (BCM/454 data); the Broad and the BC pipelines (all 80 samples)

Page 42: Next-generation sequencing: informatics & software aspects

Overlap between call sets

45217238.05%1.21

2,2961,86281.10%3.40

413245.81%0.23

Broad callsBC calls

# SNP calls:# dbSNPs:% dbSNPs:Ts/Tv ratio:

Page 43: Next-generation sequencing: informatics & software aspects

The 1000G Structural Variation Discovery Effort

Page 44: Next-generation sequencing: informatics & software aspects

Structural variation detection

Feuk et al. Nature Reviews Genetics, 2006

Page 45: Next-generation sequencing: informatics & software aspects

SV detection – resolution

Expected CNVsKaryotype

Micro-arraySequencing

Rela

tive

num

bers

of e

vent

s

CNV event length [bp]

Page 46: Next-generation sequencing: informatics & software aspects

Detection Approaches

• Read Depth: good for big CNVs

Sample Reference

Lmap

read

contig

• Paired-end: all types of SV

• Split-Readsgood break-point resolution

• deNovo Assembly~ the future

SV slides courtesy of Chip Stewart, Boston College

Page 47: Next-generation sequencing: informatics & software aspects

47

Read depth (RD)

Page 48: Next-generation sequencing: informatics & software aspects

Statistical & systematic biases

Page 49: Next-generation sequencing: informatics & software aspects

Single molecule sequencing?

GC Bias

Coverage bias

Page 50: Next-generation sequencing: informatics & software aspects

RD resolution

dens

ity (l

og10

)

Illum

ina

obse

rved

read

cou

nts (

per k

b)

expected read counts ( per kb)

CN = 2

CN = 1

CN = 3

Page 51: Next-generation sequencing: informatics & software aspects

CNV events detected with RD

Read

s/2k

b lo

g 2(o

/e)

Chromosome 2 Position [Mb]

Read

s/5k

b lo

g 2(o

/e)

Read

s/kb

log 2

(o/e

)

Helicos

40 kb deletion 454

Illumina

individual “NA12878”

Page 52: Next-generation sequencing: informatics & software aspects

SV detection with PE read map positions

Deletion

Individual Reference

LMLdel

Tandemduplication Ldup

InversionLinv

Page 53: Next-generation sequencing: informatics & software aspects

Fragment length distributions

• long fragments ~ better fragment coverage and sensitivity to large events (454)• tighter distributions ~ better breakpoint resolution and sensitivity for shorter events (Illumina)

Page 54: Next-generation sequencing: informatics & software aspects

The SV/CNV event display

chromosomeoverview

fragment lengths

read depth

eventtrack

300 bp deletion in chromosome of NA12878 by Illumina paired-end data from the 1000 Genomes project

Chip Stewart

Page 55: Next-generation sequencing: informatics & software aspects

Deletion event lengths

ALUY

L1

Page 56: Next-generation sequencing: informatics & software aspects

Mobile element insertions: PE reads

• Used with short-read data (Illumina, in our case)• Detect clusters of 5’ & 3’ read pairs with one end mapping

to a mobile element• Clusters far from annotated elements are candidate

insertion events

ALU / L1 / SVA5’ end 3’ end

Page 57: Next-generation sequencing: informatics & software aspects

Mobile element insertions: Split reads

• Requires longer reads (454)• Reads “mapping into” mobile element not present in the

reference genome sequence are candidate insertion events

Page 58: Next-generation sequencing: informatics & software aspects

Mobile element insertions: trio members

356185 182

NA12891684

NA12892664

ALU insertions

NA12878733

47

7996

10

Detection in 1000 G Pilot 2 CEU trio PE Illumina data

Page 59: Next-generation sequencing: informatics & software aspects

Mobile element insertions: PE vs. Split-reads

569247 163454 Split-Read817

SLX PE733

ALU insertions in NA12878

Page 60: Next-generation sequencing: informatics & software aspects

BC event lists in 1000 Genomes data

SV typePilot 1140

samples low coverage

Pilot 26 samples

high coverage

deletions 5,555 4,718tandem

duplications 540 406mobile

element insertions

3,276 2,013

Page 61: Next-generation sequencing: informatics & software aspects

SV calls / validation in 1000G datasets

Page 62: Next-generation sequencing: informatics & software aspects

Software access

http://bioinformatics.bc.edu/marthlab/Software_Release

Page 63: Next-generation sequencing: informatics & software aspects

Credits

Elaine Mardis

Andy Clark

Aravinda Chakravarti

Michael Egholm

Scott Kahn

Francisco de la Vega

Patrice MilosJohn Thompson

R01 HG004719R01 HG003698R21 AI081220RC2 HG5552

Page 64: Next-generation sequencing: informatics & software aspects

Lab

Several positions are available:grad students / postodocs /

programmers

Page 65: Next-generation sequencing: informatics & software aspects

Can we find exon CNVs in Pilot 3 data?

Page 66: Next-generation sequencing: informatics & software aspects

SVs in exon sequencing data