View
459
Download
4
Tags:
Embed Size (px)
Citation preview
Blue Fast, accurate error correction using k-mer consensus and context
CSIRO COMPUTATIONAL INFORMATICS
Paul Greenfield, Denis Bauer
15 October 2013
This work was funded by CSIRO Transformational Biology
Error correction algorithms
• All fundamentally the same • Find an error within a read (or a part of a read)
– Using fixed-length k-mers, variable length sub-reads, whole reads
• Find the ‘best’ replacement for the broken part of the read
– k-mer consensus, suffix arrays/trees of sub-reads, alignment to consensus
– Trivial much of the time, but correcting the ‘hard’ cases properly is essential
– Repetitive regions multiple possible corrections – which one is right?
– Much easier if only fixing substitution errors (and ignoring ins/del errors)
Correcting DNA sequence data | Paul Greenfield 3 |
0
20
40
60
80
100
120
1
14
27
40
53
66
79
92
10
5
11
8
13
1
14
4
15
7
17
0
Original
Healed
0
0.5
1
1.5
2
2.5
3
3.5
4
0 100 200 300 400 500
% o
f k-
me
rs
Repetition Depth
Blue overview
• Blue does k-mer consensus correction • Chooses between multiple possible fixes by trialling them in the context of the
read
• Recursive exploration of the tree of potential ‘fixed’ reads to find ‘best’ fix
– Tree exploration error-limited to improve efficiency
• Handles both Illumina and 454-like data
• Blue separates consensus from the reads being corrected • Possible to correct long (454) reads with a much larger set of Illumina k-mers
– Combine 454 read length with depth of Illumina
– Addresses 454 homopolymer problem (different error models)
• Blue has an option to discard poor-looking reads (‘-good’) • Throw away sequencing artefacts and very broken reads
• Being used internally within CSIRO now • Moth genome, bacterial & metagenomic projects
Correcting DNA sequence data | Paul Greenfield 4 |
Algorithms under test
Correcting DNA sequence data | Paul Greenfield 5 |
FASTA In
FASTQ In
Fmt in=out
Ins, Dels?
Ns fixed?
Pairs kept?
Multi-Threads
Blue Yes Yes Yes Yes Yes Yes Yes
Coral Yes Yes Yes Yes Yes Yes Yes
Echo No Yes Yes No No No Yes
HiTEC Yes Yes No No No No No
HSHREC
Single-line No No Yes No No Yes
Quake No Yes Yes No Yes Yes Yes
Reptile Yes No No No Yes Yes No
SHREC
Single-line Yes Yes No No No Yes
Performance tests
Correcting DNA sequence data | Paul Greenfield 6 |
Elapsed (mins)
Processor (mins)
Memory used (GB) Threads
RPM (elapsed)
ERA000206
Blue 52 203 0.6 4 547,758
Coral 1,596 12,054 39.0 8 17,817
Echo Ran but did not complete
HiTEC 699 699 13.4 1 40,670
HSHREC 808 5,790 30.5 8 35,184
Quake Failed
Reptile 320 320 4.3 1 88,766 SHREC 465 1,994 33.0 4 61,080
ERR022075
Blue 36 139 0.6 4 1,280,006
Coral 2,752 21,004 47.0 8 16,511
Echo Ran but did not complete
HiTEC 1,365 1,365 11.0 1 33,290
HSHREC 1,363 9,586 8 33,351
Quake Failed
Reptile 509 509 2.8 1 89,297
SHREC 625 2,405 27.5 4 72,681
Testing accuracy and effectiveness
• Downloaded E. coli K12 MG1655 datasets from SRA • Two Illumina datasets (28M & 45M paired 100-bp reads)
• Two 454 datasets (350K & 144K reads)
• Accuracy (using Bowtie & Bowtie2) • Aligned corrected reads to E. coli K12 MG1655 reference sequence
– More reads aligned with 0 mismatches more accurate correction
– Expect some genetic drift and some sequencing artefacts in practice
• Effectiveness (using Velvet) • Assembled...
– Raw and corrected Illumina data
– Combined 454 + Illumina data
– Perfect synthetic ‘reads’ used for comparison
• Do you get longer contigs that contain fewer errors?
– Compare contig lengths and error density in contigs
Correcting DNA sequence data | Paul Greenfield 7 |
Accuracy test (Illumina 28M)
Correcting DNA sequence data | Paul Greenfield 8 |
57.0%
98.8%
79.5% 84.4%
62.9%
88.8%
5.0% 0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Original Blue Reptile Shrec Coral HiTEC HSHREC
ERA000206 Bowtie2 alignment mismatches
10+ mismatches
9 mismatches
8 mismatches
7 mismatches
6 mismatches
5 mismatches
4 mismatches
3 mismatches
2 mismatches
1 mismatch
0 mismatches
Accuracy test (454)
Correcting DNA sequence data | Paul Greenfield 10 |
(using a indel-capable aligner)
1%
51%
76%
65%
95%
41%
10%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Original Blue 454 Blue 454x2 Blue 220275 Blue 22075x2 Coral HSHREC
SRR029323 Bowtie2 alignment mismatches
10+ mismatches
9 mismatches
8 mismatches
7 mismatches
6 mismatches
5 mismatches
4 mismatches
3 mismatches
2 mismatches
1 mismatch
0 mismatches
Assembly: contig lengths
Correcting DNA sequence data | Paul Greenfield 12 |
0
50
100
150
200
250
0
50000
100000
150000
200000
250000
Raw Blue (all) Blue (good) Coral HiTEC Reptile Shrec
CD
S b
reak
age
s
Co
nti
g le
ngt
hs
Velvet contigs - ERA000206 - k=41
Contigs max
Contigs N50
Contigs N90
Assembly: contig lengths with miscalls
Correcting DNA sequence data | Paul Greenfield 13 |
0
50
100
150
200
250
0
50000
100000
150000
200000
250000
Raw Blue (all) Blue (good) Coral HiTEC Reptile Shrec
CD
S b
reak
age
s
Co
nti
g le
ngt
hs
Velvet contigs - ERA000206 - k=41
Contigs max
Contigs N50
Contigs N90
Broken CDS
Assembly: 454+Illumina
Correcting DNA sequence data | Paul Greenfield 14 |
0
100
200
300
400
500
600
0
50000
100000
150000
200000
250000
Raw Blue (all) Blue (good) Coral HiTEC Reptile Shrec Blue454+206
Coral454+206
454+206 454+206Hi
Velvet contigs - ERA000206 - k=41
MaxContigLength
ContigN50
ContigN90
BrokenCDS
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 5000000
Miscalls - ERA000206 Contigs vk=41
Original (385)
Blue (38)
BlueGood (32)
Synth (36)
HiTEC (50)
Coral (355)
Reptile (171)
SHREC (107)
454 + Illumina (493)
454Blue +IlluminaBlue (16)
454 + IlluminaHiTEC(55)
454Coral +IlluminaCoral (428)
Assembly accuracy - miscalls
Correcting DNA sequence data | Paul Greenfield 16 |
Density of miscalls (including real differences and alignment artefacts) along MG1655 genome. Data generated by Mauve assembler-testing tools (Aaron Darling)
rhsD alignments (vk=41)
Correcting DNA sequence data | Paul Greenfield 17 |
Synth
Blue 454+Illumina
Blue
Blue Good
Raw
HiTEC
Coral
Reptile
SHREC
2x
2x
2x
2x
2x
950bp region repeated in pseudogene (with somewhat divergent margins)
Summing up
• Correction can significantly improve alignment & assembly results • Most published algorithms are not very effective
– Transparent component in a processing pipeline
– Fast enough and scalable enough to handle real datasets
– And... improve results enough to be worthwhile
• Blue... • Uses the context of an error to decide between alternative fixes
– Recursive search of the tree of potential ‘repaired’ reads to find the ‘best’
• Separates reads from consensus
– Allowing cross-correction between different types of data
• Testing showed Blue to be the most accurate and fastest
• Available from www.bioinformatics.csiro.au/Blue
Correcting DNA sequence data | Paul Greenfield 19 |
Bioinformatics and Biostatistics Paul Greenfield Research Group Leader
t +61 2 9325 3250 e [email protected] w www.csiro.au/CCI
CSIRO MATHEMATICS, INFORMATICS AND STATISTICS
Thank you Paul Greenfield (CCI) Denis Bauer (CCI) Konsta Duesing (CAFHS) Alexie Papanicolaou (CES) Supported by David Lovell & CSIRO Transformational Biology