Blue - fast, accurate error correction using k-mer consensus and context - Paul Greenfield, Denis Bauer

Blue Fast, accurate error correction using k-mer consensus and context

CSIRO COMPUTATIONAL INFORMATICS

Paul Greenfield, Denis Bauer

15 October 2013

This work was funded by CSIRO Transformational Biology

Error correction algorithms

• All fundamentally the same • Find an error within a read (or a part of a read)

– Using fixed-length k-mers, variable length sub-reads, whole reads

• Find the ‘best’ replacement for the broken part of the read

– k-mer consensus, suffix arrays/trees of sub-reads, alignment to consensus

– Trivial much of the time, but correcting the ‘hard’ cases properly is essential

– Repetitive regions multiple possible corrections – which one is right?

– Much easier if only fixing substitution errors (and ignoring ins/del errors)

Correcting DNA sequence data | Paul Greenfield 3 |

0

20

40

60

80

100

120

1

14

27

40

53

66

79

92

10

5

11

8

13

1

14

4

15

7

17

0

Original

Healed

0

0.5

1

1.5

2

2.5

3

3.5

4

0 100 200 300 400 500

% o

f k-

me

rs

Repetition Depth

Blue overview

• Blue does k-mer consensus correction • Chooses between multiple possible fixes by trialling them in the context of the

read

• Recursive exploration of the tree of potential ‘fixed’ reads to find ‘best’ fix

– Tree exploration error-limited to improve efficiency

• Handles both Illumina and 454-like data

• Blue separates consensus from the reads being corrected • Possible to correct long (454) reads with a much larger set of Illumina k-mers

– Combine 454 read length with depth of Illumina

– Addresses 454 homopolymer problem (different error models)

• Blue has an option to discard poor-looking reads (‘-good’) • Throw away sequencing artefacts and very broken reads

• Being used internally within CSIRO now • Moth genome, bacterial & metagenomic projects


Algorithms under test


FASTA In

FASTQ In

Fmt in=out

Ins, Dels?

Ns fixed?

Pairs kept?

Multi-Threads

Blue Yes Yes Yes Yes Yes Yes Yes

Coral Yes Yes Yes Yes Yes Yes Yes

Echo No Yes Yes No No No Yes

HiTEC Yes Yes No No No No No

HSHREC

Single-line No No Yes No No Yes

Quake No Yes Yes No Yes Yes Yes

Reptile Yes No No No Yes Yes No

SHREC

Single-line Yes Yes No No No Yes

Performance tests


Elapsed (mins)

Processor (mins)

Memory used (GB) Threads

RPM (elapsed)

ERA000206

Blue 52 203 0.6 4 547,758

Coral 1,596 12,054 39.0 8 17,817

Echo Ran but did not complete

HiTEC 699 699 13.4 1 40,670

HSHREC 808 5,790 30.5 8 35,184

Quake Failed

Reptile 320 320 4.3 1 88,766 SHREC 465 1,994 33.0 4 61,080

ERR022075

Blue 36 139 0.6 4 1,280,006

Coral 2,752 21,004 47.0 8 16,511

Echo Ran but did not complete

HiTEC 1,365 1,365 11.0 1 33,290

HSHREC 1,363 9,586 8 33,351

Quake Failed

Reptile 509 509 2.8 1 89,297

SHREC 625 2,405 27.5 4 72,681

Testing accuracy and effectiveness

• Downloaded E. coli K12 MG1655 datasets from SRA • Two Illumina datasets (28M & 45M paired 100-bp reads)

• Two 454 datasets (350K & 144K reads)

• Accuracy (using Bowtie & Bowtie2) • Aligned corrected reads to E. coli K12 MG1655 reference sequence

– More reads aligned with 0 mismatches more accurate correction

– Expect some genetic drift and some sequencing artefacts in practice

• Effectiveness (using Velvet) • Assembled...

– Raw and corrected Illumina data

– Combined 454 + Illumina data

– Perfect synthetic ‘reads’ used for comparison

• Do you get longer contigs that contain fewer errors?

– Compare contig lengths and error density in contigs


Accuracy test (Illumina 28M)


57.0%

98.8%

79.5% 84.4%

62.9%

88.8%

5.0% 0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Original Blue Reptile Shrec Coral HiTEC HSHREC

ERA000206 Bowtie2 alignment mismatches

10+ mismatches

9 mismatches

8 mismatches

7 mismatches

6 mismatches

5 mismatches

4 mismatches

3 mismatches

2 mismatches

1 mismatch

0 mismatches

Accuracy test (454)


(using a indel-capable aligner)

1%

51%

76%

65%

95%

41%

10%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Original Blue 454 Blue 454x2 Blue 220275 Blue 22075x2 Coral HSHREC

SRR029323 Bowtie2 alignment mismatches

10+ mismatches

9 mismatches

8 mismatches

7 mismatches

6 mismatches

5 mismatches

4 mismatches

3 mismatches

2 mismatches

1 mismatch

0 mismatches

Assembly: contig lengths


0

50

100

150

200

250

0

50000

100000

150000

200000

250000

Raw Blue (all) Blue (good) Coral HiTEC Reptile Shrec

CD

S b

reak

age

s

Co

nti

g le

ngt

hs

Velvet contigs - ERA000206 - k=41

Contigs max

Contigs N50

Contigs N90

Assembly: contig lengths with miscalls


0

50

100

150

200

250

0

50000

100000

150000

200000

250000

Raw Blue (all) Blue (good) Coral HiTEC Reptile Shrec

CD

S b

reak

age

s

Co

nti

g le

ngt

hs


Contigs max

Contigs N50

Contigs N90

Broken CDS

Assembly: 454+Illumina


0

100

200

300

400

500

600

0

50000

100000

150000

200000

250000

Raw Blue (all) Blue (good) Coral HiTEC Reptile Shrec Blue454+206

Coral454+206

454+206 454+206Hi


MaxContigLength

ContigN50

ContigN90

BrokenCDS

0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 5000000

Miscalls - ERA000206 Contigs vk=41

Original (385)

Blue (38)

BlueGood (32)

Synth (36)

HiTEC (50)

Coral (355)

Reptile (171)

SHREC (107)

454 + Illumina (493)

454Blue +IlluminaBlue (16)

454 + IlluminaHiTEC(55)

454Coral +IlluminaCoral (428)

Assembly accuracy - miscalls


Density of miscalls (including real differences and alignment artefacts) along MG1655 genome. Data generated by Mauve assembler-testing tools (Aaron Darling)

rhsD alignments (vk=41)


Synth

Blue 454+Illumina

Blue

Blue Good

Raw

HiTEC

Coral

Reptile

SHREC

2x

2x

2x

2x

2x

950bp region repeated in pseudogene (with somewhat divergent margins)

Summing up

• Correction can significantly improve alignment & assembly results • Most published algorithms are not very effective

– Transparent component in a processing pipeline

– Fast enough and scalable enough to handle real datasets

– And... improve results enough to be worthwhile

• Blue... • Uses the context of an error to decide between alternative fixes

– Recursive search of the tree of potential ‘repaired’ reads to find the ‘best’

• Separates reads from consensus

– Allowing cross-correction between different types of data

• Testing showed Blue to be the most accurate and fastest

• Available from www.bioinformatics.csiro.au/Blue


http://www.bioinformatics.csiro.au/Blue

Bioinformatics and Biostatistics Paul Greenfield Research Group Leader

t +61 2 9325 3250 e [email protected] w www.csiro.au/CCI

CSIRO MATHEMATICS, INFORMATICS AND STATISTICS

Thank you Paul Greenfield (CCI) Denis Bauer (CCI) Konsta Duesing (CAFHS) Alexie Papanicolaou (CES) Supported by David Lovell & CSIRO Transformational Biology

Technology

Blue - fast, accurate error correction using k-mer consensus and context - Paul Greenfield, Denis Bauer