Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Bioinformatics - Lecture 04
BioinformaticsSequence alignment methods
Martin Saturka
http://www.bioplexity.org/lectures/
EBI version 0.5
Creative Commons Attribution-Share Alike 2.5 License
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Alignment
Approximate pattern matching - common sequence taskssequence comparison, domains, motif searchsimilarity and homology between sequences
Alignment approachesbasic methods- dot matrices, scoring matrices- dynamic programmingheuristic algorithms- word methods- fasta types, blast typesmultiple alignment- HMM approaches- MCMC based approaches
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Proteins
sequences of aminoacid acyls: N-end to C-end ordering
flexible strands, enzymatic, signalling and structural functions
20 symbols alphabet
collagen helix triple: Gly G, Pro P, HYPaliphatic: Ala A, Val V, Leu L, Ile I;aromatic: Phe F, Tyr Y, Trp W;polar: Cys C, Ser S, Thr T - His H;Met M, Pro P; Gln Q, Asn N; Tyr Ycharged: Asp D, Glu E; Lys K, Arg R;non-standard: selenocysteine Sec, pyrrolysine
peptide backbone with peptide bondstwenty different standard residua
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Dot matrix plot
simple search for diagonals of matched pairs
sequence vs. sequence matrix plotdots on positions with matched pairsrecurrences, identical regions
diagonal smoothingmeans of match/mismatch positionsmismatches and gaps just qualitatively
how to count it more accurately:similarity - scoring functionsgaps - start, prolongation
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Similarity types
similarity definition: what is it?
proteinshydrophobic vs. hydrophilicaromatic cyclecharge and polaritysize and flexibilityfunctional residuasecondary structure proneness
nucleic acidspurine vs. pyrimidineAT vs. GCcoding neutralityRNA pairing preservation
similarityhomology (ortho/para-logy) vs. homoplasy (convergence)
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Similarity measures
log odds matrices
similarity scores based on percentage of changes
log odds:Mi,j = log qi,j/(pi · pj)
logarithm of the ratio of observed vs. expected frequences
matrix constructionalign two sequences of the same lengthcount all the substitution mutations
unknown mutation direction → count it in the both ways
compute the log oddslinear transform and round the values
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Log odd example0101101001010100101001101011001001101010
Pr(0) = 21/40 = 0.525Pr(1) = 19/40 = 0.475
expected alignment (mutation) frequencies:p0p0 = (21/40)2 = 0.275625p0p1 = (21/40) · (19/40) = 0.249375p1p0 = (19/40) · (21/40) = 0.249375p1p1 = (19/40)2 = 0.225625
observed alignment (mutation) frequencies:0 → 0: q00 = 14/40 = 0.350 → 1: q01 = 7/40 = 0.1751 → 0: q10 = 7/40 = 0.1751 → 1: q11 = 12/40 = 0.3
the logarithm and the scoring:ln[(14 ∗ 40)/(21 ∗ 21)] = 0.23889191 → 2ln[(7 ∗ 40)/(21 ∗ 19)] = −0.35417181 → −4ln[(7 ∗ 40)/(19 ∗ 21)] = −0.35417181 → −4ln[(12 ∗ 40)/(19 ∗ 19)] = 0.28490815 → 3
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Scoring matrices
standard substitution scoring matricesPAM for closely related speciesBLOSUM for distant sequences
PAM - Point Accepted MutationsPAM1 for 1 point substitution per 100 aminoacidsPAMn computed as a stochastic processes
assumption of Markov processamounts of subsequent changesM2 = M12, Mn = M1n
greater n → for greater evolutionary distances
BLOSUM - BLOck SUbstitution MatrixBLOSUMn
on datasets of sequences with at most n-% identityBLOSUM100 from the total data setsBLOSUM62 usually used
lesser n → for greater evolutionary distances
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Dynamic programming approach
pairwise similarity alignment
sequence vs. sequence matrixsimilarity, gaps computationanalogy to the longest path search
global searchto align sequence-to-sequence at whole lengthsNeedleman-Wunsch algorithm
various modifications of the algorithmlocal search
to find maximally similar subsequence alignmentsSmith-Waterman algorithm
itself a variation of the Needleman-Wunsch algorithm
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Global search
exampleS1: E S C H - E R
| | . |S2: - S C E N E -
alignment of two sequencessimilarity scoring matrixconstant gap penalty measure
algorithmmake a free matrix of size (|S1|+ 1)× (|S2|+ 1)
rows, columns idexed by the two given sequencesgap values in the uppermost row and the leftmost column
zero-valued for tail ignoring
start in the upper left cornerpass rightward and downward
maximally valued alignments by taking maxima ofsums of current alignments and passes to new positions
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Global search example
alignment of ESCHER and SCENE sequences
E S C H E R0 g 2g 3g 4g 5g 6g
S g a1,1
C 2g ↘m ↓g1
E 3g →g2
ai,j
N 4gE 5g
sequences: S1, S2similarity matrix: si,jgap penalty: galignment scores: ai,j
start:a1,1 = max(sS1,S2 , g, g)
iterate:am
i,j = ai−1,j−1 + sSi ,Sj
ag1i,j = ai−1,j + g, ag2
i,j = ai,j−1 + gai,j = max(am
i,j , ag1i,j , ag2
i,j )
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Global search result
E S C H E R0 -2 -4 -6 -8 -10 -12
S -2 0(m) 2(m) 0(g2) -2(g2) -4(g2) -6(g2)C -4 -2(g1) 0(g1) 11(m) 9(g2) 7(g2) 5(g2)E -6 1(m) -1(g2) 9(g1) 13(m) 14(m) 12(g2)N -8 -1(g1) 2(m) 7(g1) 11(g1) 13(m) 14(m)E -10 -3(m) 0(g1) 5(g1) 9(m) 16(m) 14(g2)
m, g1, g2 are for a match, gapsg = −2 for the gap penalty used
part of the BLOSUM62 matrix:
backward search ESCH-ERfor the alignment -SCENE-
C E H N R SC 9 -4 -3 -3 -3 -1E -4 5 2 0 0 0H -3 2 8 1 0 -1N -3 0 1 6 0 1R -3 0 0 0 5 -1S -1 0 -1 1 -1 4
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Modifications
some of the global search variations
linear memory usage with time doublingHirschberg’s algorithm
small alignments hashingjust log time speed-up
Hirschberg’s algorithmquadratic memory needed for the backward search onlymake two alignments - for the first, second half of S1 and S2we find the middle point on S1 of the whole alignmentiterate recursively on subparts of the S1 string
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Gaps
deletion / insertion regions
a case of many small gaps is evolutionary less probablethan a case of one (a few of) large gapshow to incorporate it into alignments?
affine gap penaltiesgap start - more badgap continuation - less bad
g = c1 + c2 · gap lengthnecessary to keep subalignment values forthe cases of gap passes just less good than match passes
general gap penaltiesharder to compute with such scenarios
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Local alignment
search for similar subsequences
local alignment specificsit does not care about unrelated partsit has to outline the similar regionsnecessary (effective) gap penalties
local alignment algorithmstart as with the global alignment but iterate withmaking all the subalignment values non-negative!find the greatest subalignment value, can beanywhere inside the alignment iteration matrixtrace its path backward until zero value is approachedcan search for all sufficiently high valued subalignments
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Heuristic methods
how to slove daily requests
sequence alignments the current top-most tasks
exact methods find alignment optima, however undertoo high memory and time processing requirements
Needleman-Wunsch, Smith-Waterman algorithms
heuristic methods necessary for the real world demandsgeneral heuristic alignment outline
start with limited local matchesenlarge matches while the alignment score is large enough
two main approaches: fasta, blast typesfor both nucleotide and amino acid sequences
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Fasta
fasta algorithm principle
find identity matches by the dot-plot matrixstandard sizes of the k-tuples searched
length of 6 for nucleotides, length of 2 for amino acidslook-up table or hashing for substring storage of one string
enlarge hot-spot parts of diagonalsusage of substitution scoring matricesstill no deletions, insertions allowed
combine nearby subsequences into local alignmentsmake a weighted directed graphweighted vertices are single diagonal alignmentsedges between possibly adjoint subalignmentsfind the most weighted path on the graph
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Blast program
the current way to do pairwise sequence alignment
effective search on genome-large databasesapproximation of the Smith-Waterman algorithmsufficiently fast for the world demands
faster than fastamuch faster than the optimum guaranteed search
less sensitive approach, still largelly sufficient
word-based search methodgenome databases, preprocessed in advancequery (target) sequence that is searched in the databases
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Blast method
blast algorithm principle
words inside genome databases indexed in advancethree steps of the search process itself
word search, word list expanding, match enlarging
look for exact matcheswords from the query string against the word databasestandard word lengths: 3 for amino acids, 11 for nucleotides
enlarge low sensitivityinitial sensitivtiy low due to the search of exact matches onlytake all the high similarity words of the indexed databasesimilarity by a chosen substitution scoring matrix
find pairs of nearby high-score matchesrequires to have at least two closely located matchesenlarge regions of alignment pairs while high scoresthe enlarging stage takes most of the time
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Statistics on alignment scores
ideal world case
look for outliers in search resultsoutliers according to z, standrad deviations
s = [1/(n − 1) ·∑n
i (xi − µ)]1/2
zi = (xi − µ)/snormal distribution nice but not the case for usbinomial and Poisson distribution more accurate ones
both of them are skewed to the right!we can find high z-score cases more likely
binomial distribution - two parameters: n and pn - number of times of processing an experimentp - probability of a success in an experiment
Poisson distributiongood approximation of the binomial distributionfor large n and small p valuesPr(X = x) = (1/x!) · e−np · (np)x
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Statistics on database search
real world case
real search pitfallsdatabases inhomogenous with respect to sequence typesneed to adjust parameters for general search queriessearch results contain probable amounts of sequencesof the result similarity found by chance
extreme value distribution - for maximum segment pairsprobability estimation
P(s > S) > 1 − exp(−Kmne−vS)m, n - lengths of the query string, of the databaseK , v - parameters that are to be adjusteds - score variable, S - score cut off
the expected amount: expect .= Kmne−vS
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Variations on the BLAST
a lot of software based on the standard blast program
PSI-BLAST: position specific iterative blastfor evolutionary distant relatives of a given proteinusage of position specific scoring matrices (PSSM)
size of PSSM is 20× profile lengthiterative work of the PSI-BLAST
closely related proteins are found firsta profile of the found proteins is madePSSM created for the profile on the found proteinsnew search done with the new PSSM
PHI-BLAST: pattern hit initiated blastfor pattern search in protein databasessearch for proteins which contain the pattern and aresimilar to the query sequence nearby the pattern
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Multiple alignment
MSA - multiple sequence alignment
pair-wise alignment based methodsfrom the most similar pairs
probability based profiling methodsprofiles based on stochastic assumptionsHMM methods, MCMC approaches
usage for:consensus sequence determinationmotifs, domains characterizationphylogenetic analysis, cladisticsstructure and function similarity
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Pair-wise MSA
heuristic multiple alignment construction
progressive multiple alignmentclustering approach, phylogenetic tree construction
create a phylogenetic tree by a clusteringbottom-up clustering, e.g. neighbor joiningfind the most related sequences and align themiterate the alignment along the treei.e. make alignment for the other sequences
iterative multiple alignmentstart like the progressive multiple alignmentadds current alignment modificationsadjustig based on hill-climbing methods
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
HMM approach
assumption of no or weak distant interaction
we do not look far away along the sequences
good for proteins, not for nucleic acidscounting all possible matches and gapsworks nice with a given HMM profilereasonable results on HMM profiling
states of the HMMsmatches, insertions, deletionsmatching a profile position to a sequence positioninserting to / deleting from a protein sequncewith respect to the given HMM profile
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
HMM example
I 0
M1
D1
I I I I I I1 2 3 4 5 6
MM2 M3 M4 5 M6
D D D DD2 3 4 5 6
EB
SCHNEE: B - M1,S - M2,C - M3,H - M4,N - M5,E - M6,E - EESCHER: B - I0,E - M1,S - M2,C - M3,H - D4,− - M5,E - M6,R - ESCENE: B - M1,S - M2,C - M3,E - M4,N - M5,E - D6,− - E
B, E for begin, end states of the HMM alignmentMi , Ii , Di for match, insertion, deletion states
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
HMM and domains
problems with gaps positioning between domains
the model topology is for affine gap penaltiesgaps can be rather largeGibbs sampling as a solution
model surgeryto remove under-used match statesto add match states in place of over-used insert states
FIM - free inserion modulesadded to both (start and stop) endswithout specificity on symbol outputsprobability ratio of inner insertions vs. FIM entry
decrease / icrease of amount of insertions inside domains
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
MCMC
Markov chain Monte Carlo
kind of sampling from a probability distributionthe Markov chain is constructed in such a way thatits staionary distribution is the required distributionseveral classes of methods for such constructionsiteration on the Markov chain has to lead to the requiredprobability distribution, usually a slow process
Gibbs sampling - one of the approacheslocal alignments for a domain searchfor distributions of l-mer alignments
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Gibbs sampling
genearal method outlinemaking samples of one variable out of many variablessubsequent samplings form a Markov chain
suitable values given by its stationary distribution
herein, the sampling is of domain locationspositions of motifs inside given sequences asthe stationary distributions
with high, single peaks demanded
locations of putative l-mer domainssubsequent sampling of individual sequence locationsprocess finished when a local aligment is optimalmany trials and many start configurations necessary
dealing with problems of local optima
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
GS alignment
start - choose random positions of l-mers inside sequencesiterate the sampling:
take one of the sequences for the sampling processmake a (local) alignment of the rest l-merscreate profile out of the current alignmenttry all the positions of l-mer on the taken sequencecompute probabilities of such l-mers given current profilechoose one position according to the computedprobabilities
stop - when total alignment scores do not increase
Gibbs sampling obstacleswe have to take care about local optimamake projections on conserved residua
random subsequences of the l-mers for profile constructionthe length of the putative domains is usually unknown
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Distant domains
search for the least amount of sort reversals
domain orderpermutations of particular exons
gene locationsrearrangements of chromosomes
reversal algorithm, iterativeif there exists a decreasing strip
find the decreasing strip with the lowest number i insidefind the increasing strip with i − 1 number insiderevert the interim sequence, break amount decreases
otherwise revert an incresing stripno more break originates, one decreasing strip more
finally no breaks, i.e. the identity permutation
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment
Items to remember
Nota bene:
similarity and homology types
Dynamic programming approachScoring matrices, gapsGlobal and local alignmentsHeuristics and statistics
Multiple sequence alignmentHMMs: match, insertion, deletion statesMCMC approach, Gibbs sampling methods
Martin Saturka www.Bioplexity.org Bioinformatics - Alignment