Indel Mappers

Indel Mappers

Indel Mapper

• Pindel – A Pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads– Kai Ye, Marcel H. Schulz, Quan Long, Rolf Apweiler

and Zemin Ning

The programs

• Stampy – A statistical algorithm for sensitive and fast mapping of Illumina sequence reads– Gerton Lunter and Martin Goodson (Gen Res Oct

2010)• Last - Probabilistic alignments with quality

scores: an application to short-read mapping toward accurate SNP/indel detection– Michiaki Hamada, Edward Wijaya, Martin C. Frith

and Kiyoshi Asai (Bioinformatics Oct 2011)

Flow of PIndel

• Aim: Compute precise breakpoints as well as the fragments inserted or deleted compared to the reference from paired-end reads

1. Use SSAHA2 to map all reads to reference– If both ends are uniquely mapped, Keep them– If one end is uniquely mapped (no mismatch

allowed for this anchoring end)• Other end must be mapped with a threshold of at

least 20 (alignment score for ~36bp read)

Finding the unmapped end

• Given a unique anchor of one end, find the locus of its unmapped pair and its fragments– 2 fragments if it is a deletion– 3 fragments if it is an insertion


• Due to an deletion (must be supported >=2 reads)

– User specify Max. delete size, Min_F & Min_C


• Due to insertion (<=20bp for 36bp reads)– Insertion must be supported by >=2 reads– Compute min&max unique substrings (US) of both

5’&3’ ends of the unmapped read– Check if minUS_5’ is adjacent with maxUS_3’ and

vice versa– The region between minUS_5’ and minUS_3’ is

the inserted fragment

Outline of Stampy

• Scanning the read• Phred scores• Similarity Filtering• Single End - Mapping Posterior• Paired-end reads: paired-end candidates

Scanning the read

• Overlapping 15mers considered– Including 1-mismatch ‘neighbours’

• For reads >34bp and <50bp long– 1-mismatch neighbours are considered for half of

the 15mers– reads >=50bp long, only a-third of the 15mers are

considered

Phred scores

• Corresponding positions of the read are marked with a Phred score– 0, if it is repetitive (>200 occurrences in the

reference); for its 1-neighbor, it is marked by the Phred quality of the mutated base

– All positions of non-repetitive 15mers are retrieved

– The scores are used to calculate the mapping posterior later

Similarity Filtering

• Three 4-nt words close to but non-overlapping with the 15mer are chosen– Counts of A-C-G-T for these 12 read-positions– Counts of A-C-G-T for these 12 positions at the

putative genomic location– Get the absolute difference between the two sets

of counts (read and reference); Score T– Candidate positions exceeding T will be discarded

Single End - Mapping Posterior• Probability that a mapping is incorrect

• Lopt is max likelihood mapping location

• The sum runs over all considered locations• This is only an approximate as correct location is not

considered among all Li

1. Read contains highly repetitive 15mers 2. Low quality or highly diverged from reference3. Sequence is not represented in reference

• Final mapping Phred score is summing 1, 2, 3

1 P(read | ‐ Lopt ) / Σ P(read | Li )

Paired-end reads: paired-end candidates

• Pair is unmapped if no candidates found for both reads• Report the pair-coordinates – Best locations for pair are with 4sd of mean insert-length

OR– Phred score >=2 in (1 & 2)

• Else– Candidates which constitute 99.9% of the posterior

mapping score of the single read are extracted• Its mate will be mapped against the reference implied by the

insert-length distribution

Paired-end reads: paired-end candidates

• Final mapping quality– Product of the top-scoring single-end hits selected

as the pair– Or Single-end posterior score of anchoring hit

LAST

• Uses probabilistic alignment instead of maximum score-based alignment– Based on posterior decoding technique which uses

marginal probabilities that incorporate all possible alignments with quality scores

Outline of LAST

• Incorporating quality scores into a score matrix

• Probabilistic model for alignment• Marginal Probabilities• Probabilistic alignments with quality scores– Y-centroid alignment– LAMA alignment

Incorporating quality scores into a score matrix

• Old Method: Sa,b is the substitution score of aligning nucleotide reference-a onto read-b

• Incorporate quality-score, q, into S

– T is a scaling factor

Probabilistic model for alignment

• Let S(A) be the score for alignment A.• For a local alignment A, the probability of A – x is the genome region– y is the read-base with a quality score– S(A) is computed from the ‘new’ substitution

score matrix

Marginal Probabilities

• Pik is the marginal probability that a base xi (i-th base of x) aligns with a base yk (k-th base of y)

• qi is the marginal probability that a base xi aligns with a gap

• Ui is the marginal probability that xi belongs to an un-aligned region that is not contained in the local alignment

Probabilistic alignments with quality scores

• Two methods considering quality scores by using the marginal probabilities1. Y-centroid alignment2. LAMA alignment

Y-centroid alignment

• Maximizing S(A) for alignment A– Y is a parametric input

• xi~yk is an aligned column (without gaps) in A– Computed from NW algorithm

Parameter Y

• Adjusts the sensitivity and precision of the aligned columns– When Y is low, LAST is conservative and only align

bases with high probabilities– When Y is high, increases rate of alignments at the

cost of more false-positives• Y-centroid is bad– Even with a low-Y, LAST may still contain many

gaps

LAMA alignment

• Consider the aligned and gap explicitly• For the gaps

Deletion in alignment Insertion in alignment

Documents

Indel Mappers