PE-Assembler: De novo assembler using short paired-end reads

Pramila Nuwantha Ariyaratne

Outline

• Method– Read screening– Seed building– Contig extension– Scaffolding– Gap filling

• Result

Data-sets Used

• Single end reads

• Paired end reads– ReadLength (from 25bp to 100bp)– Insert size vary from MinSpan to MaxSpan– The information are mainly from this data-sets.

Overview

• Read screening step select a set of reads as starting point.

• Seed building step extend these reads using Single End Reads to make them longer than MaxSpan. Successfully extended regions are called seeds.

• Contig extension try to extend all seeds using paired-end reads, result sequences called contigs.

Read screening

• Get all k-mers from all the reads.– A k-mer that is expected to occur in the actual

genome is called a ‘solid’ k-mer.– A k-mer that is expected to occur within a repeat

region is called a ‘repeat’ k-mer.• Repeat Region:– ACTTTGACACACACACAC……ACACACACGTTGAG

Read screening

• A read is solid read if:– All it’s k-mers are within the two threshold cut-off.

• Example:– Two cut-off [42, 120] from previous graph.– K=5– Read: ACCGTATA– ACCGT, CCGTA, CGTAT, GTATA– 100, 70, 90, 140– Not a solid read.

Read screening

• Example:– Two cut-off [42, 120] from previous graph.– K=5– Read: ACCGTATG– ACCGT, CCGTA, CGTAT, GTATG– 100, 70, 90, 70– A solid read.

Seed Building

• Try to extend the solid read using all overlapping reads.

Seed Building• Because of sequencing errors or small repeats, there maybe

multiple feasible candidates.

Seed Building

• Ambiguities due to sequencing errors, we extend every candidate base up to ReadLength.– If only one candidate path reach the full distance

ReadLength, then that path is assumed to be correct extension.

• If no path or more than one path found. Try other side.

Seed Building

• Finally, when the sequence reach MaxSpan, (called seed) do a verification.

• At least one paired-end reads overlaps with this seed within expected length [MinSpan, MaxSpan]

Contig Extension

• This step aims to extend each verified seed to form a longer contig using Paired-End reads.

• For multiple feasible candidates, may due to 3 reasons.– First, sequencing errors. – Second, short tandem repeat. Handling in Gap

Filing step.– Third, long repeat. Which longer than MaxSpan.

Scaffolding

• Find the correct ordering of the resulting set of contigs.

• Gao Song currently working on it.

Gap filling

• Gap filling step is to assemble the gap region between two adjacent contigs to form a longer contig.

Gap filling

Simulated data results.

• Result compare using:– Average Length of all contigs.– N50, N90 of contigs. Bigger better.– Coverage.– Large Misassembly: accuracy is much more

important than others.

Simulated data results. E. Coli S. Pombe HG18 chr10

200bp + 10kbp 200bp + 1kbp + 10kbp 200bp + 10kbp 200bp + 1kbp + 10kbp 200bp + 1kbp + 10kbp

PA Allpaths2 Velvet PA Allpaths2 PA Allpaths2 Velvet PA Allpaths2 PA

Contig statistics

Contigs (>200bp) 23 31 37 6 44 53 190 193 31 164 3158

Average length (kb) 202.7 155.4 125.6 777.4 107.6 231.8 65.2 63.7 394.7 75.3 39.1

Maximum length (kb) 2109.4 732.7 1506.8 2492.6 593.7 3500.9 868.4 1062.8 3519.6 851.0 514.5

Contig N50 size (kb) 883.9 357.1 1413.7 2492.6 362.7 1499.4 236.6 602.2 1487.7 226.8 89.0

Contig N90 size (kb) 355.7 92.4 597.4 2146.0 83.2 210.0 65.7 148.6 507.6 76.4 24.3

Coverage 99.89% 99.83% 99.60% 100.00% 99.85% 97.56% 98.62% 98.95% 97.78% 98.60% 94.20%

Evaluation

Large misassemblies 0 0 2 0 0 0 1 1 0 0 1

Segment maps 99.20% 99.27% 95.00% 99.68% 99.18% 96.02% 96.78% 93.38% 96.42% 96.83% 90.48%

Performance1

Execution time (min) 12 215 8 21 227 76 853 26 101 734 1682

Memory usage (gb) 1.3 15.4 2 2.3 29.7 3.8 45 5.3 4.5 66 16

Thank you for attention.

PE-Assembler: De novo assembler using short paired-end reads

Documents

A51 Assembler Reference Manualcourses.cs.washington.edu/courses/cse466/01au/Lab/A251.pdfA251 macro assembler and how it is used. The A251 assembler is a superset of A51 assembler

AVR Assembler User Guidecs2121/AVR/AVR-Assembler-Guide.pdfAVR Assembler User Guide Development Tools User Guide 4-5 4.4 Instruction mnemonics The Assembler accepts mnemonic instructions

Single-cell RNA-seq analysis - Bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/...• Mouse embryonic stem cell (mESC) data (869 samples) • 200bp paired-end reads,1.28×1012

SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output

Specia! !nsert: NEW ALTA)R HANDS-O SAVINGSN . …The DOS Assembler converts an assembly language program to machine language in two passes. In the first pass the Assembler reads the

MeFiT Merging and Filtering Tool for Illumina Paired-End ... · In paired-end sequencing, accurate merging of the forward and reverse reads is a crucial first step that affects the

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads

ARM® IAR Assembler - search read.pudn.comread.pudn.com/downloads64/ebook/226438/ARM-IAR-Assembler-Refer… · AARM-7 iv ARM® IAR Assembler Reference Guide Descriptions of assembler

CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencingbest.snu.ac.kr/pub/paper/2014.9 BMC-casper.pdf · 2014. 9. 11. · In paired-end sequencing,

PEST Domain Mutations in Notch Receptors …...RNA-seq was performed on PDX models with 100 bp (base pair) paired-end reads. Raw RNA-seq reads were ﬁltered using Xenome (22) to remove

Data Preprocessing - Technical University of Denmark · 2019. 9. 24. · Data preprocessing. Merge paired end reads. 16 • Merge overlapping pairs into single longer read • Smart

MacVector 17 · Virtual RNA-Seq Cloning 1 June 2020 Virtual RNA-Seq Cloning 5 your hard drive. The file contains paired-end reads in “interleaved” format where the reads are organized

Shianna WeilCornell lecture march26 2013 - Cornell …physiology.med.cornell.edu/faculty/mason/lab/backup_clinical... · paired-end reads will improve detection of ... Number of observations

ARACHNE: A Whole-Genome Shotgun Assembler · ARACHNE: A Whole-Genome Shotgun Assembler ... To test ARACHNE, we created simulated reads providing ∼10-fold coverage of the genomes

AVR Assembler - Microchip Technology...AVR Assembler AVR Assembler Preface Welcome to the Microchip AVR® Assembler. The Assembler generates fixed code allocations, consequently no

Machine dependent Assembler Features - Official Site of ...dwidiastuti.staff.gunadarma.ac.id/.../24989/Machine_dependent.pdf · Assembler Features • Machine Dependent Assembler

Detecting Copy Number Variation With Short Paired Reads

AVR Assembler User Guide - people.ece.cornell.edu · AVR Assembler User Guide 4-4 Development Tools User Guide 4.3 Assembler source The Assembler works on source files containing

UNIVERSITY OF MUMBAI Sem-VI.pdfpass structure of assembler, Assembler Design: Two pass assembler Design and single pass Assembler Design for Hypothetical / X86 family processor, data

and@reverse@reads@in@paired@end@sequencing/7...Long’reads:BLAST vs‘blat’ • BLAST7is7notthe7righttool.7 • BLAST7requires7thataquery7sequence7contains7the7same711@mer7