SeqMap: mapping massive amount of oligonucleotides to the genome
Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396
The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides
from next-generation sequencing Nathan Clement et al. Bioinformatics (2010) 26: 38-45
Presented by: Xia Li
Short-read mapping softwareSoftware Technique ReferenceGNUMAP Hashing refs + base quality +
repeated regions Clement et al., 2010
Novoalign Hashing refs Novocraft, unpublishedSOAP Hashing refs Li et al., 2008SeqMap Hashing reads Jiang et al., 2008RMAP Hashing reads + read quality Smith et al., 2008Eland Hashing reads Cox, unpublishedBowtie BWT Langmead et al., 2009
Slider lexicographically sorting + base quality Malhis et al., 2009
SeqMap
• Motivation– Hashing genome usually needs large memory (e.g.
SOAP needs 14GB memory when mapping to the human genome)
– Allow more substitutions and insertion/deletion
SeqMap
• Pigeonhole principle– Spaced seed alignment– ELAND, SOAP, RMAP
• Hash reads• Insertion/deletion:
2/4 combinations with1/2 shifted one nucleotideto its left or right
Short Read
Short read look up table (indexed by 2 parts)
Split into 4 parts
All combinations of 2/4 parts
Reference GenomeImage credit: J. Ruan
Experiment & Result
Experiment & Result
• Deal with more substitutions and insertion/deletion
Randomly generate a DNA sequence of a length of 1Mb, add 100Kb random substitutions, N’s and insertion/deletions
GNUMAP
• Motivation– Base uncertainty
• Such as nearly equal or low probabilities to A, C, G or T• Filter low quality reads [RMAP] -> discard up to half of the
reads (Harismendy et al., 2009)– Repeated regions in the genome
• Discard them -> loss of up to half of the data (Harismendy et al., 2009)
• Record one -> unequal mapping to some of the repeat regions
• Record all -> each location having 3 times the correct score
GNUMAP
• Flow-chart
Probabilistic Needleman-Wunsch
Alignment Score
ACTGAACCATACGGGTACTGAACCATGAA
AACCAT
GGGTACAACCATTAC
Read from sequencer
GGGTACAACCAT
Read is added to both repeat regions proportionally to their match qualityweighted by its # of occurrences in the genome
Slide credit: N. Clement
Experiment & Result
Comments
• SeqMap– Pos: dealing with more
substations/insertion/deletion– Cons: memory consuming, not fast
• GNUMAP– Pos: consider base quality and repeated regions ->
generate more useful information and achieves best performance (~15% increase)
– Cos: memory consuming, slow, more noise