23
Indel Mappers

Indel Mappers

  • Upload
    trevet

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Indel Mappers. Indel Mapper. Pindel – A Pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads Kai Ye, Marcel H. Schulz, Quan Long, Rolf Apweiler and Zemin Ning. The programs. - PowerPoint PPT Presentation

Citation preview

Page 1: Indel  Mappers

Indel Mappers

Page 2: Indel  Mappers

Indel Mapper

• Pindel – A Pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads– Kai Ye, Marcel H. Schulz, Quan Long, Rolf Apweiler

and Zemin Ning

Page 3: Indel  Mappers

The programs

• Stampy – A statistical algorithm for sensitive and fast mapping of Illumina sequence reads– Gerton Lunter and Martin Goodson (Gen Res Oct

2010)• Last - Probabilistic alignments with quality

scores: an application to short-read mapping toward accurate SNP/indel detection– Michiaki Hamada, Edward Wijaya, Martin C. Frith

and Kiyoshi Asai (Bioinformatics Oct 2011)

Page 4: Indel  Mappers

Flow of PIndel

• Aim: Compute precise breakpoints as well as the fragments inserted or deleted compared to the reference from paired-end reads

1. Use SSAHA2 to map all reads to reference– If both ends are uniquely mapped, Keep them– If one end is uniquely mapped (no mismatch

allowed for this anchoring end)• Other end must be mapped with a threshold of at

least 20 (alignment score for ~36bp read)

Page 5: Indel  Mappers

Finding the unmapped end

• Given a unique anchor of one end, find the locus of its unmapped pair and its fragments– 2 fragments if it is a deletion– 3 fragments if it is an insertion

Page 6: Indel  Mappers

Finding the unmapped end

• Due to an deletion (must be supported >=2 reads)

– User specify Max. delete size, Min_F & Min_C

Page 7: Indel  Mappers

Finding the unmapped end

• Due to insertion (<=20bp for 36bp reads)– Insertion must be supported by >=2 reads– Compute min&max unique substrings (US) of both

5’&3’ ends of the unmapped read– Check if minUS_5’ is adjacent with maxUS_3’ and

vice versa– The region between minUS_5’ and minUS_3’ is

the inserted fragment

Page 8: Indel  Mappers

Outline of Stampy

• Scanning the read• Phred scores• Similarity Filtering• Single End - Mapping Posterior• Paired-end reads: paired-end candidates

Page 9: Indel  Mappers

Scanning the read

• Overlapping 15mers considered– Including 1-mismatch ‘neighbours’

• For reads >34bp and <50bp long– 1-mismatch neighbours are considered for half of

the 15mers– reads >=50bp long, only a-third of the 15mers are

considered

Page 10: Indel  Mappers

Phred scores

• Corresponding positions of the read are marked with a Phred score– 0, if it is repetitive (>200 occurrences in the

reference); for its 1-neighbor, it is marked by the Phred quality of the mutated base

– All positions of non-repetitive 15mers are retrieved

– The scores are used to calculate the mapping posterior later

Page 11: Indel  Mappers

Similarity Filtering

• Three 4-nt words close to but non-overlapping with the 15mer are chosen– Counts of A-C-G-T for these 12 read-positions– Counts of A-C-G-T for these 12 positions at the

putative genomic location– Get the absolute difference between the two sets

of counts (read and reference); Score T– Candidate positions exceeding T will be discarded

Page 12: Indel  Mappers

Single End - Mapping Posterior• Probability that a mapping is incorrect

• Lopt is max likelihood mapping location

• The sum runs over all considered locations• This is only an approximate as correct location is not

considered among all Li

1. Read contains highly repetitive 15mers 2. Low quality or highly diverged from reference3. Sequence is not represented in reference

• Final mapping Phred score is summing 1, 2, 3

1 P(read | ‐ Lopt ) / Σ P(read | Li )

Page 13: Indel  Mappers

Paired-end reads: paired-end candidates

• Pair is unmapped if no candidates found for both reads• Report the pair-coordinates – Best locations for pair are with 4sd of mean insert-length

OR– Phred score >=2 in (1 & 2)

• Else– Candidates which constitute 99.9% of the posterior

mapping score of the single read are extracted• Its mate will be mapped against the reference implied by the

insert-length distribution

Page 14: Indel  Mappers

Paired-end reads: paired-end candidates

• Final mapping quality– Product of the top-scoring single-end hits selected

as the pair– Or Single-end posterior score of anchoring hit

Page 15: Indel  Mappers

LAST

• Uses probabilistic alignment instead of maximum score-based alignment– Based on posterior decoding technique which uses

marginal probabilities that incorporate all possible alignments with quality scores

Page 16: Indel  Mappers

Outline of LAST

• Incorporating quality scores into a score matrix

• Probabilistic model for alignment• Marginal Probabilities• Probabilistic alignments with quality scores– Y-centroid alignment– LAMA alignment

Page 17: Indel  Mappers

Incorporating quality scores into a score matrix

• Old Method: Sa,b is the substitution score of aligning nucleotide reference-a onto read-b

• Incorporate quality-score, q, into S

– T is a scaling factor

Page 18: Indel  Mappers

Probabilistic model for alignment

• Let S(A) be the score for alignment A.• For a local alignment A, the probability of A – x is the genome region– y is the read-base with a quality score– S(A) is computed from the ‘new’ substitution

score matrix

Page 19: Indel  Mappers

Marginal Probabilities

• Pik is the marginal probability that a base xi (i-th base of x) aligns with a base yk (k-th base of y)

• qi is the marginal probability that a base xi aligns with a gap

• Ui is the marginal probability that xi belongs to an un-aligned region that is not contained in the local alignment

Page 20: Indel  Mappers

Probabilistic alignments with quality scores

• Two methods considering quality scores by using the marginal probabilities1. Y-centroid alignment2. LAMA alignment

Page 21: Indel  Mappers

Y-centroid alignment

• Maximizing S(A) for alignment A– Y is a parametric input

• xi~yk is an aligned column (without gaps) in A– Computed from NW algorithm

Page 22: Indel  Mappers

Parameter Y

• Adjusts the sensitivity and precision of the aligned columns– When Y is low, LAST is conservative and only align

bases with high probabilities– When Y is high, increases rate of alignments at the

cost of more false-positives• Y-centroid is bad– Even with a low-Y, LAST may still contain many

gaps

Page 23: Indel  Mappers

LAMA alignment

• Consider the aligned and gap explicitly• For the gaps

Deletion in alignment Insertion in alignment