21
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne

PE-Assembler: De novo assembler using short paired-end reads

  • Upload
    kyle

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

PE-Assembler: De novo assembler using short paired-end reads. Pramila Nuwantha Ariyaratne. Outline. Method Read screening Seed building Contig extension Scaffolding Gap filling Result. Data-sets Used. Single end reads Paired end reads ReadLength ( from 25bp to 100bp) - PowerPoint PPT Presentation

Citation preview

Page 1: PE-Assembler: De novo assembler using short paired-end  reads

PE-Assembler: De novo assembler using short paired-end reads

Pramila Nuwantha Ariyaratne

Page 2: PE-Assembler: De novo assembler using short paired-end  reads

Outline

• Method– Read screening– Seed building– Contig extension– Scaffolding– Gap filling

• Result

Page 3: PE-Assembler: De novo assembler using short paired-end  reads

Data-sets Used

• Single end reads

• Paired end reads– ReadLength (from 25bp to 100bp)– Insert size vary from MinSpan to MaxSpan– The information are mainly from this data-sets.

Page 4: PE-Assembler: De novo assembler using short paired-end  reads

Overview

• Read screening step select a set of reads as starting point.

• Seed building step extend these reads using Single End Reads to make them longer than MaxSpan. Successfully extended regions are called seeds.

• Contig extension try to extend all seeds using paired-end reads, result sequences called contigs.

Page 5: PE-Assembler: De novo assembler using short paired-end  reads

Read screening

• Get all k-mers from all the reads.– A k-mer that is expected to occur in the actual

genome is called a ‘solid’ k-mer.– A k-mer that is expected to occur within a repeat

region is called a ‘repeat’ k-mer.• Repeat Region:– ACTTTGACACACACACAC……ACACACACGTTGAG

Page 6: PE-Assembler: De novo assembler using short paired-end  reads

Read screening

Page 7: PE-Assembler: De novo assembler using short paired-end  reads

Read screening

• A read is solid read if:– All it’s k-mers are within the two threshold cut-off.

• Example:– Two cut-off [42, 120] from previous graph.– K=5– Read: ACCGTATA– ACCGT, CCGTA, CGTAT, GTATA– 100, 70, 90, 140– Not a solid read.

Page 8: PE-Assembler: De novo assembler using short paired-end  reads

Read screening

• Example:– Two cut-off [42, 120] from previous graph.– K=5– Read: ACCGTATG– ACCGT, CCGTA, CGTAT, GTATG– 100, 70, 90, 70– A solid read.

Page 9: PE-Assembler: De novo assembler using short paired-end  reads

Seed Building

• Try to extend the solid read using all overlapping reads.

Page 10: PE-Assembler: De novo assembler using short paired-end  reads

Seed Building• Because of sequencing errors or small repeats, there maybe

multiple feasible candidates.

Page 11: PE-Assembler: De novo assembler using short paired-end  reads

Seed Building

• Ambiguities due to sequencing errors, we extend every candidate base up to ReadLength.– If only one candidate path reach the full distance

ReadLength, then that path is assumed to be correct extension.

• If no path or more than one path found. Try other side.

Page 12: PE-Assembler: De novo assembler using short paired-end  reads

Seed Building

• Finally, when the sequence reach MaxSpan, (called seed) do a verification.

• At least one paired-end reads overlaps with this seed within expected length [MinSpan, MaxSpan]

Page 13: PE-Assembler: De novo assembler using short paired-end  reads

Contig Extension

• This step aims to extend each verified seed to form a longer contig using Paired-End reads.

• For multiple feasible candidates, may due to 3 reasons.– First, sequencing errors. – Second, short tandem repeat. Handling in Gap

Filing step.– Third, long repeat. Which longer than MaxSpan.

Page 14: PE-Assembler: De novo assembler using short paired-end  reads

Scaffolding

• Find the correct ordering of the resulting set of contigs.

• Gao Song currently working on it.

Page 15: PE-Assembler: De novo assembler using short paired-end  reads

Gap filling

• Gap filling step is to assemble the gap region between two adjacent contigs to form a longer contig.

Page 16: PE-Assembler: De novo assembler using short paired-end  reads

Gap filling

Page 17: PE-Assembler: De novo assembler using short paired-end  reads
Page 18: PE-Assembler: De novo assembler using short paired-end  reads
Page 19: PE-Assembler: De novo assembler using short paired-end  reads

Simulated data results.

• Result compare using:– Average Length of all contigs.– N50, N90 of contigs. Bigger better.– Coverage.– Large Misassembly: accuracy is much more

important than others.

Page 20: PE-Assembler: De novo assembler using short paired-end  reads

Simulated data results.  E. Coli S. Pombe HG18 chr10

  200bp + 10kbp 200bp + 1kbp + 10kbp 200bp + 10kbp 200bp + 1kbp + 10kbp 200bp + 1kbp + 10kbp

  PA Allpaths2 Velvet PA Allpaths2 PA Allpaths2 Velvet PA Allpaths2 PA

Contig statistics                      

Contigs (>200bp) 23 31 37 6 44 53 190 193 31 164 3158

Average length (kb) 202.7 155.4 125.6 777.4 107.6 231.8 65.2 63.7 394.7 75.3 39.1

Maximum length (kb) 2109.4 732.7 1506.8 2492.6 593.7 3500.9 868.4 1062.8 3519.6 851.0 514.5

Contig N50 size (kb) 883.9 357.1 1413.7 2492.6 362.7 1499.4 236.6 602.2 1487.7 226.8 89.0

Contig N90 size (kb) 355.7 92.4 597.4 2146.0 83.2 210.0 65.7 148.6 507.6 76.4 24.3

Coverage 99.89% 99.83% 99.60% 100.00% 99.85% 97.56% 98.62% 98.95% 97.78% 98.60% 94.20%

                       

Evaluation                      

Large misassemblies 0 0 2 0 0 0 1 1 0 0 1

Segment maps 99.20% 99.27% 95.00% 99.68% 99.18% 96.02% 96.78% 93.38% 96.42% 96.83% 90.48%

                       

Performance1                      

Execution time (min) 12 215 8 21 227 76 853 26 101 734 1682

Memory usage (gb) 1.3 15.4 2 2.3 29.7 3.8 45 5.3 4.5 66 16

Page 21: PE-Assembler: De novo assembler using short paired-end  reads

Thank you for attention.