24
Making the most of short reads Making the most of short reads Torsten Seemann Victorian Bioinformatics Consortium Monash University

Making the most of short reads torsten seemann - agrf ngs sig - 28 apr 2009

Embed Size (px)

Citation preview

Page 1: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

Making the most of short readsMaking the most of short reads

Torsten Seemann

Victorian Bioinformatics ConsortiumMonash University

Page 2: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 2

Outline

● About the VBC● Sequencing technologies● Read mapping● Applications● Conclusion● Questions

Page 3: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 3

What is the VBC ?

● Victorian Bioinformatics Consortium● 2000-2005

– Monash .med .infotech, CSIRO, DPI– $4M STI grant from State Govt.

● 2005+– Dept. Microbiology, Monash Uni.– NHMRC/ARC Network Parisitology– Micromon (sequencing centre)

Page 4: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 4

Where is the VBC ?

● Monash Uni.● Clayton Campus● STRIP2 / Bldg 76● Level 2● Microbiology● Rooms 223-225

Page 5: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 5

VBC capabilities

● Sequence analysis● Assembly, annotation, SNPs● Anything-omics!● Microarray analysis/storage● Data mining/visualization● Custom software development● Computer system architecture

Page 6: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 6

VBC Collaborators

● Monash Uni.● Uni. Melbourne● Bio21● UNSW, Uni. Syd● UQ : IMB● MIMR, MMC, Austin● MISCL

● CSIRO : FSA, LI● USDA : ARS● Pasteur Institute● TIGR● UCSD● UCLA● Uni. Copenhagen

Page 7: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 7

Sanger sequencing

● Dye terminated capillary sequencing● Read length ~ 300 - 900 bp● Yield ~ 1 Mbp per day maximum● Cost ~ $HIGH

Page 8: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 8

Roche 454 FLX+

● Pyro-sequencing ● Read length ~ 100 - 250 bp● Yield ~ 600 Mbp (250 bp PE)● Run time ~ 1 day● Prep time ~ 5 days● Homo-polymer run errors● Cost $MEDIUM

Page 9: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 9

ABI SOLID 3

● Sequencing by ligation● Read length ~ 35 – 50 bp● Yield ~ 15,000 Mbp (50 bp PE)● Run time ~ 14 days● Prep time ~ ? days● Colour space error propagation● Cost $MEDIUM

Page 10: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 10

Illumina GA2 (Solexa)

● Sequencing by synthesis● Read length ~ 36 – 100 bp● Yield ~ 6,000 Mbp (36bp PE)● Run time ~ 5 days● Prep time ~ 1 day● No homo-polymer errors● Cost $LOW

Page 11: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 11

Illumina output 36bp

Bad read

@HWUSI-EAS100R:3:1:5:1526#0/1TCCCTTGCATTACTCTTAATCGAGGAAATCCCTTTG+HWUSI-EAS100R:3:1:5:1526#0/1abbaaaaaaaaaaaaaaaaa_X^WT]a```a_a\`\

@HWUSI-EAS100R:3:1:3:1073#0/2TGNNNNNNCAAATTCANNNNNNNTCNNTTTATATCT+HWUSI-EAS100R:3:1:3:1073#0/2a\DDDDDD^[K]BBBBBBBBBBBBBBBBBBBBBBBB

Good read

'B'=Q2 Pr(wrong)=0.38

'a'=Q33 Pr(wrong)=0.0005

Page 12: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 12

Read mapping

● Align 108 36bp reads to 5 Mbp reference● Traditional tools too slow● New crop of “short read aligners” (SRA)

– SHRiMP – MAQ– Bowtie– ELAND– Novocraft

Page 13: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 13

SRA capabilities● SNP = Single nucleotide polymorphism

– Subsitution, eg. A → C– insertion or deletion (“indel”) eg. A → -

● Warning: not all aligners support indels!● We tend to use SHRiMP

– Supports substitutions and indels– Fast SIMD implementation & parallelizable– Full post-hit Smith-Waterman alignment– Will identify “most” high scoring hits

Page 14: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 14

Genome coverage

● Mapped 7 M reads to 4 Mbp genome● Yellow line is mean coverage (56x)● Bowl shaped coverage = circular genome● Could be used to guide scaffolding

Page 15: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 15

Missing DNA

● Read coverage drops to zero where reference has DNA that the new sequence does not

● LB022 absent● hemH present

Page 16: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 16

Repeated DNA

● Coverage increases in repeated areas● LA_SNP3199 is probably triplicated in

this strain – depth 120, average 40

Page 17: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 17

SNPs

● SNPs appear as dips/pinches in the coverage graph

● LA1299 gene has possible 4 SNPs relative to ref.

● Rest of gene has average coverage

Page 18: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 18

Repairing 454 data

● 454 has “homopolymer” errors● Loses track if same base > 3 times in row● Traditional assemblers don't like too many

indels or frame shifts● 454 developed Newbler assembler● Challenging for hybrid assemblies● What if we could “repair” our 454 data?

Page 19: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 19

454 Repair Guide

● One sample with 454 and Illumina reads● Get a read mapper supporting indels● Align all your Illumina reads to 454 data● If sufficient un-ambiguous depth

– correct the 454 sequence!

● Can apply to old closed sequences, 454 contigs, 454 reads etc.

● Find old errors via resequencing

Page 20: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 20

Example repair>FF6ELPM06G1HYY original 180bpAAATCTAAAAGAATAGTCGTGGAGCAGGTAGAAAACCTAGATTTACTGAAGAAGAAAAAATATTATAAGAGCTCAAAGAAAAGAAGGAAAACAATAAAAGAGCTTGCAACTTTAAATAATTGTAGCTTTGGAGTAATTCATAAAATTTTACATGAATAATAAATAAAAGGGGATTGAGAT

Sequence Pos Change type Old New EvidenceFF6ELPM06G1HYY 11 insertion-before - A "A"x166FF6ELPM06G1HYY 61 insertion-before - A "A"x212 "-"x12FF6ELPM06G1HYY 92 insertion-before - A "A"x368 "-"x1

>FF6ELPM06G1HYY repaired 183bpAAATCTAAAAAGAATAGTCGTGGAGCAGGTAGAAAACCTAGATTTACTGAAGAAGAAAAAAATATTATAAGAGCTCAAAGAAAAGAAGGAAAAACAATAAAAGAGCTTGCAACTTTAAATAATTGTAGCTTTGGAGTAATTCATAAAATTTTACATGAATAATAAATAAAAGGGGATTGAGAT

Page 21: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 21

Trimming short reads

● Quality worsens toward 3' end ● Many reads have “N” basecalls● Variation across flowcell/slide

● Will reduce data size● Trade quality for depth● Is it worth it?

Page 22: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 22

Should I trim?● For 36 bp

– Results are mixed– Usually best NOT to trim– Depth will “fix” most errors

● For 75+ bp– 3' quality can be very poor– Seems best to trim– Not all reads need trimming

● More research needed

Page 23: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 23

Conclusion

● Short read mapping is a powerful tool for genomic discovery

– Automated analysis eg. SNPs– Visualization eg. depth/coverage graphs– Repairing longer read data

● Still need de novo assembly for unmapped reads

Page 24: Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009

07/08/12 Making the most of short reads 24

Contact me

Webhttp://www.vicbioinformatics.com/

[email protected]