Indel Mappers

Indel Mapper

• Pindel – A Pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads– Kai Ye, Marcel H. Schulz, Quan Long, Rolf Apweiler

and Zemin Ning

The programs

• Stampy – A statistical algorithm for sensitive and fast mapping of Illumina sequence reads– Gerton Lunter and Martin Goodson (Gen Res Oct

2010)• Last - Probabilistic alignments with quality

scores: an application to short-read mapping toward accurate SNP/indel detection– Michiaki Hamada, Edward Wijaya, Martin C. Frith

and Kiyoshi Asai (Bioinformatics Oct 2011)

Flow of PIndel

• Aim: Compute precise breakpoints as well as the fragments inserted or deleted compared to the reference from paired-end reads

1. Use SSAHA2 to map all reads to reference– If both ends are uniquely mapped, Keep them– If one end is uniquely mapped (no mismatch

allowed for this anchoring end)• Other end must be mapped with a threshold of at

least 20 (alignment score for ~36bp read)

Finding the unmapped end

• Given a unique anchor of one end, find the locus of its unmapped pair and its fragments– 2 fragments if it is a deletion– 3 fragments if it is an insertion

• Due to an deletion (must be supported >=2 reads)

– User specify Max. delete size, Min_F & Min_C

• Due to insertion (<=20bp for 36bp reads)– Insertion must be supported by >=2 reads– Compute min&max unique substrings (US) of both

5’&3’ ends of the unmapped read– Check if minUS_5’ is adjacent with maxUS_3’ and

vice versa– The region between minUS_5’ and minUS_3’ is

the inserted fragment

Outline of Stampy

• Scanning the read• Phred scores• Similarity Filtering• Single End - Mapping Posterior• Paired-end reads: paired-end candidates

Scanning the read

• Overlapping 15mers considered– Including 1-mismatch ‘neighbours’

• For reads >34bp and <50bp long– 1-mismatch neighbours are considered for half of

the 15mers– reads >=50bp long, only a-third of the 15mers are

considered

Phred scores

• Corresponding positions of the read are marked with a Phred score– 0, if it is repetitive (>200 occurrences in the

reference); for its 1-neighbor, it is marked by the Phred quality of the mutated base

– All positions of non-repetitive 15mers are retrieved

– The scores are used to calculate the mapping posterior later

Similarity Filtering

• Three 4-nt words close to but non-overlapping with the 15mer are chosen– Counts of A-C-G-T for these 12 read-positions– Counts of A-C-G-T for these 12 positions at the

putative genomic location– Get the absolute difference between the two sets

of counts (read and reference); Score T– Candidate positions exceeding T will be discarded

Single End - Mapping Posterior• Probability that a mapping is incorrect

• Lopt is max likelihood mapping location

• The sum runs over all considered locations• This is only an approximate as correct location is not

considered among all Li

1. Read contains highly repetitive 15mers 2. Low quality or highly diverged from reference3. Sequence is not represented in reference

• Final mapping Phred score is summing 1, 2, 3

1 P(read | ‐ Lopt ) / Σ P(read | Li )

Paired-end reads: paired-end candidates

• Pair is unmapped if no candidates found for both reads• Report the pair-coordinates – Best locations for pair are with 4sd of mean insert-length

OR– Phred score >=2 in (1 & 2)

• Else– Candidates which constitute 99.9% of the posterior

mapping score of the single read are extracted• Its mate will be mapped against the reference implied by the

insert-length distribution

Paired-end reads: paired-end candidates

• Final mapping quality– Product of the top-scoring single-end hits selected

as the pair– Or Single-end posterior score of anchoring hit

• Uses probabilistic alignment instead of maximum score-based alignment– Based on posterior decoding technique which uses

marginal probabilities that incorporate all possible alignments with quality scores

Outline of LAST

• Incorporating quality scores into a score matrix

• Probabilistic model for alignment• Marginal Probabilities• Probabilistic alignments with quality scores– Y-centroid alignment– LAMA alignment

Incorporating quality scores into a score matrix

• Old Method: Sa,b is the substitution score of aligning nucleotide reference-a onto read-b

• Incorporate quality-score, q, into S

– T is a scaling factor

Probabilistic model for alignment

• Let S(A) be the score for alignment A.• For a local alignment A, the probability of A – x is the genome region– y is the read-base with a quality score– S(A) is computed from the ‘new’ substitution

score matrix

Marginal Probabilities

• Pik is the marginal probability that a base xi (i-th base of x) aligns with a base yk (k-th base of y)

• qi is the marginal probability that a base xi aligns with a gap

• Ui is the marginal probability that xi belongs to an un-aligned region that is not contained in the local alignment

Probabilistic alignments with quality scores

• Two methods considering quality scores by using the marginal probabilities1. Y-centroid alignment2. LAMA alignment

Y-centroid alignment

• Maximizing S(A) for alignment A– Y is a parametric input

• xi~yk is an aligned column (without gaps) in A– Computed from NW algorithm

Parameter Y

• Adjusts the sensitivity and precision of the aligned columns– When Y is low, LAST is conservative and only align

bases with high probabilities– When Y is high, increases rate of alignments at the

cost of more false-positives• Y-centroid is bad– Even with a low-Y, LAST may still contain many

LAMA alignment

• Consider the aligned and gap explicitly• For the gaps

Deletion in alignment Insertion in alignment

Indel Mappers

Documents

Crayons, Mappers& Somewheres, oh my!

Development of Chloroplast- based InDel Markers for the

A Comparative Analysis of Computational Indel …worldcomp-proceedings.com/proc/p2014/BIC3387.pdfA Comparative Analysis of Computational Indel Calling Pipelines for Next Generation

Indel rates and probabilistic alignments

THE VENUS GEOLOGIC MAPPERS' HANDBOOK · u.s. department of the interior u.s. geological survey the venus geologic mappers' handbook second edition compiled by kenneth l. tanaka1 with

Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

A Simplified View of DCJ-Indel Distance Phillip Compeau A Simplified View of DCJ- Indel Distance Phillip Compeau University of California-San Diego Department

What does the future hold for traditional style mappers?

INDEL Operating System I S M - 6 - Indelligent Automation · 2017. 4. 26. · ISM-6.0 INDEL BETRIEBSSYSTEM 970708 F. Baschung INDEL AG , CH-8332 Russikon INDEL Operating System I

A Simplified View of DCJ-Indel Distance

InDel - medicinalcrop.org

Adaptive MapReduce using Situation-Aware Mappers

NORTH CAROLINA PROPERTY MAPPERS … CAROLINA PROPERTY MAPPERS ASSOCIATION ADVANCED MAPPING ... the reference base for fieldwork ... more accurate methods. Pacing consists of counting

Dynamic mappers of NGS reads - Institut Gaspard Mongeigm.univ-mlv.fr/AlgoB/slides/Brinda_SeqBio_2014.pdf · · 2014-11-18Dynamic mappers of NGS reads Karel Břinda ... (* character),

Field Mappers for Laser Material Processing

Www.corep.info Excel XBRL mappers Steering Committee, 2005-06-03

THE VENUS GEOLOGIC MAPPERS' HANDBOOK · the venus geologic mappers' handbook second edition compiled by kenneth l. tanaka* with contributions from henry j. moore2, gerald g. schaber1,

Layouts: Creating maps and figures using ArcMap 1 GIS for Planetary Mappers June 2012, Planetary Mappers Meeting

Those 3% female mappers… Why they participate and why not?

Financial Mappers Pro: Getting Started Guide for Advisers...Financial Mappers Pro: Getting Started Guide for Advisers 5 Financial Mappers Pro Refer to the Overview Document for details