FuzzyPath - A Hybrid De novo FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Assembler using Solexa and 454
Short ReadsShort Reads
Zemin NingZemin Ning
The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute
Outline of the Talk:
Assembly strategy Read extension using base qualities and read pairs Repeat junctions and single base variation Fuzzy kmers – how to find mismatches Assemblies with mixed Solexa and 454 reads Solexa reads guided by a closely related reference Long Solexa reads with 70 bps Future Work
Assembly StrategyAssembly Strategy
Selexa reads assembler toextend long reads of 1-2Kb
Genome/Chromosome
Capillary reads assemblerPhrap/Phusion
forward-reverse paired reads
30-70 bp
known dist
~500 bp
30-70 bp
Kmer Extension & Repeat JunctionsKmer Extension & Repeat Junctions
Quality Filters on JunctionsQuality Filters on Junctions
Repetitive Contig and Read PairsRepetitive Contig and Read Pairs
DepthDepthFor each hit read in the For each hit read in the contig, contig index and contig, contig index and offset are stored.offset are stored.
Insert lengthInsert length
Current read positionCurrent read position
Contig startContig start
Pair read positionPair read position
DepthDepth
Handling of Single Base Variations Handling of Single Base Variations
ACGTAACTACGTAACTAAACAGTTACAGTT00 01 10 11 00 00 01 11 00 01 10 11 00 00 01 11 0000 00 01 00 10 11 11 00 01 00 10 11 11
ACGTAACTACGTAACTCCACAGTTACAGTT00 01 10 11 00 00 01 11 00 01 10 11 00 00 01 11 0101 00 01 00 10 11 11 00 01 00 10 11 11
ACGTAACT ACAGTTACGTAACT ACAGTT00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0101 00 00 00 00 00 00 00 00 00 00 00 00
Number of Mismatches between Two Kmers Number of Mismatches between Two Kmers
Use of Kmers with Mismatches Use of Kmers with Mismatches
Mixed Solexa and 454 ReadsMixed Solexa and 454 Reads
L = ~250 bp
L-K+1 kmers
L-N-K+1 kmers
Pileup of 454 reads at a repeat junction Pileup of 454 reads at a repeat junction
Pileup of Pileup of SolexaSolexa and 454 Reads and 454 Reads
Guided by A Closely Related ReferenceGuided by A Closely Related Reference
L = 3000 bp
L-K+1 kmers
L-N-K+1 kmers
Pileup of shredded reads at a repeat junctionPileup of shredded reads at a repeat junction
Pileup of Pileup of SolexaSolexa and Shredded Reads and Shredded Reads
Long Solexa Reads with 70 bpLong Solexa Reads with 70 bp
L = 70 bp
L-K+1 kmers
Pileup of long Solexa reads at a repeat junctionPileup of long Solexa reads at a repeat junction
Pileup of Long 70 bp Pileup of Long 70 bp SolexaSolexa Reads Reads
Solexa reads:Number of reads: 3,084,185;Finished genome size: 2,007,491 bp;Read length: 39 and 36 bp;Estimated read coverage: ~55X;Number of 454 reads: 100,000;Read coverage of 454: 10X;
Assembly features: - contig statsTotal number of contigs: 73;Total bases of contigs: 1,999,817 bpN50 contig size: 62,508;Largest contig: 162,190 Averaged contig size: 27,394;Contig coverage over the genome: ~99 %;Contig extension errors: 2Mis-assembly errors: 3
S.SuisS.Suis P1/7 Solexa/454 Assembly P1/7 Solexa/454 Assembly
Shredded reads:Number of reads: 1,338,161;Finished genome size: 2,007,491 bp;Read length: 36;Estimated read coverage: 24X;Insert size: 500 bp;
Assembly features:Paired_Data Not_Paired
Number of contigs: 35 317Total assembled bases: 1.996 Mb 1.956 MbN50 contig size: 243,039 13,929Largest contig: 474,070 33,460Averaged contig size: 57,043 6,168Contig coverage: >99.0 % >99.0 %Contig extension errors: 0 0Mis-assembly errors: 3 2
S.Suis S.Suis P1/7 with Shredded Pair-end ReadsP1/7 with Shredded Pair-end Reads
Solexa reads:Number of reads: 6,346,317;Finished genome size: 4.7 Mbp;Read length: 33 bp;Estimated read coverage: ~40 X;Shredded reference of SpA: 10X;
Assembly features: - contig statsTotal number of contigs: 66;Total bases of contigs: 4,615,704 bpN50 contig size: 168,793;Largest contig: 401,700 Averaged contig size: 69,934;Contig coverage over the genome: ~98 %;Contig extension errors: 0Mis-assembly errors: 2
Salmonella delhi5 Salmonella delhi5 Solexa AssemblySolexa AssemblyGuided by A Close ReferenceGuided by A Close Reference
Shredded reads:Number of reads: 1,338,161;Finished genome size: 2,007,491 bp;Read length: 36;Estimated read coverage: 24X;Insert size: 500 bp;
Assembly features:Paired_Data Not_Paired
Number of contigs: 35 317Total assembled bases: 1.996 Mb 1.956 MbN50 contig size: 243,039 13,929Largest contig: 474,070 33,460Averaged contig size: 57,043 6,168Contig coverage: >99.0 % >99.0 %Contig extension errors: 0 0Mis-assembly errors: 3 2
S Suis S Suis P1/7 Shredded Read AssemblyP1/7 Shredded Read Assembly
Acknowledgements:
Yong Gu Ben Blackburne Hannes Ponstingl Harold Swerdlow Michael Quail Tony Cox Richard Durbin