Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

1

1

Sequence Alignment

Marco BottaDipartimento di Informatica

Università di [email protected]

www.di.unito.it/~botta/didattica/

2Università di Torino

Sequence Comparison

Much of bioinformatics involves sequences• DNA sequences• RNA sequences• Protein sequencesWe can think of these sequences as strings of

letters• DNA & RNA: alphabet of 4 letters• Protein: alphabet of 20 letters


Sequence Comparison (cont)

• Finding similarity between sequences isimportant for many biological questions

For example:• Find genes/proteins with common origin

– Allows to predict function & structure

• Locate common subsequences in genes/proteins– Identify common “motifs”

• Locate sequences that might overlap– Help in sequence assembly

2


Sequence Alignment

Input: two sequences over the same alphabetOutput: an alignment of the two sequencesExample:• GCGCATGGATTGAGCGA

• TGCGCCATTGATGACCA

A possible alignment:-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A


Alignments

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Three elements:• Perfect matches• Mismatches• Insertions & deletions (indel)


Choosing Alignments

There are many possible alignmentsFor example, compare:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

to------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Which one is better?

3


Scoring Alignments

Rough intuition:• Similar sequences evolved from a common ancestor• Evolution changed the sequences from this

ancestral sequence by mutations:– Replacements: one letter replaced by another– Deletion: deletion of a letter– Insertion: insertion of a letter

• Scoring of sequence similarity should examine howmany operations took place


Simple Scoring Rule

Score each position independently:• Match: +1• Mismatch: -1• Indel -2Score of an alignment is sum of positional

scores


Example

Example:-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Score: (+1x13) + (-1x2) + (-2x4) = 3

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Score: (+1x5) + (-1x6) + (-2x12) = -25

4


More General Scores

• The choice of +1,-1, and -2 scores was quitearbitrary

• Depending on the context, some changesare more plausible than others– Exchange of an amino-acid by one with similar

properties (size, charge, etc.)vs.– Exchange of an amino-acid by one with

opposite properties


• We define a scoring function by specifyinga function

– σ(x,y) is the score of replacing x by y– σ(x,-) is the score of deleting x– σ(-,x) is the score of inserting x

• The score of an alignment is the sum ofposition scores

Additive Scoring Rules

ℜ−∪Σ×−∪Σ �}){(}){(:σ


Edit Distance

• The edit distance between two sequences isthe “cost” of the “cheapest” set of editoperations needed to transform onesequence into the other

• Computing edit distance between twosequences almost equivalent to finding thealignment that minimizes the distance

nment)score(aligmax),d( & of alignment 21 ss21 ss =

5


Computing Edit Distance

• How can we compute the edit distance??– If |s| = n and |t| = m, there are more than

alignments

• The additive form of the score allows toperform dynamic programming tocompute edit distance efficiently

+m

nm


Recursive Argument

• Suppose we have two sequences:s[1..n+1] and t[1..m+1]

The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )

])[],[(])..[],..,[(])..[],..[(1mt1ns

m1tn1sd1m1t1n1sd++

+=++σ


Recursive Argument



)],[(])..[],..,[(])..[],..[(

−+++=++

1ns1m1tn1sd1m1t1n1sd

σ

6


Recursive Argument



])1[,(])..1[],1..,1[(])1..1[],1..1[(

+−++=++

mtmtnsdmtnsd

σ


Recursive Argument

Define the notation:

• Using the recursive argument, we get thefollowing recurrence for V:

+−++−++++++

=++])[,(],[)],[(],[

])[],[(],[max],[

1jtj1iV1is1jiV

1jt1isjiV1j1iV

σσ

σ

])..[],..[(],[ j1ti1sdjiV =


Recursive Argument

• Of course, we also need to handle the basecases in the recursion:

])[,(],[],[)],[(],[],[

],[

1jtj0V1j0V1is0iV01iV

000V

+−+=+−++=+

=

σσ

7


Dynamic ProgrammingAlgorithm

We fill the matrix using the recurrence rule

0A1

G2

C3

0

A 1

A 2

A 3

C 4


0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

Dynamic ProgrammingAlgorithm

Conclusion: d(AAAC,AGC) = -1


Reconstructing the BestAlignment

• To reconstruct thebest alignment, werecord which case inthe recursive rulemaximized the score

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

8



• We now trace back thepath that corresponds tothe best alignment

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

AAACAG-C



• Sometimes, more thanone alignment has thebest score

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

AAACA-GC


Complexity

Space: O(mn)Time: O(mn)• Filling the matrix O(mn)• Backtrace O(m+n)

9


Local Alignment

Consider now a different question:• Can we find similar substring of s and t ?• Formally, given s[1..n] and t[1..m] find

i,j,k, and l such that d(s[i..j],t[k..l]) ismaximal


Local Alignment

• As before, we use dynamic programming• We now want to setV[i,j] to record the best

alignment of a suffix of s[1..i] and a suffixof t[1..j]

• How should we change the recurrence rule?


Local Alignment

New option:• We can start a new match instead of extend

previous alignment

+−++−++++++

=++

01jtj1iV

1is1jiV1jt1isjiV

1j1iV])[,(],[)],[(],[

])[],[(],[max],[

σσ

σ

Alignment of empty suffixes

10


]))[,(],[,max(],[))],[(],[,max(],[

],[

1jtj0V01j0V1is0iV001iV

000V

+−+=+−++=+

=

σσ

Local Alignment

• Again, we also need to handle the basecases in the recursion:


Local Alignment Example

0A1

T2

C3

T4

A5

A6

0

T 1

A 2

A 3

T 4

A 5

s = TAATAt = ATCTAA



0T1

A2

C3

T4

A5

A6

0 0 0 0 0 0 0 0

T 1 0 1 0 0 1 0 0

A 2 0 0 2 0 0 2 1

A 3 0 0 1 1 0 1 3

T 4 0 0 0 0 2 0 1

A 5 0 0 1 0 0 3 1

s = TAATAt = TACTAA

11



0T1

A2

C3

T4

A5

A6

0 0 0 0 0 0 0 0

T 1 0 1 0 0 1 0 0

A 2 0 0 2 0 0 2 1

A 3 0 0 1 1 0 1 3

T 4 0 0 0 0 2 0 1

A 5 0 0 1 0 0 3 1

s = TAATAt = TACTAA



0T1

A2

C3

T4

A5

A6

0 0 0 0 0 0 0 0

T 1 0 1 0 0 1 0 0

A 2 0 0 2 0 0 2 1

A 3 0 0 1 1 0 1 3

T 4 0 0 0 0 2 0 1

A 5 0 0 1 0 0 3 1

s = TAATAt = TACTAA


Sequence Alignment

We have seen two variants of sequence alignment:• Global alignment• Local alignmentOther variants:• Finding best overlap (exercise)

All are based on the same basic idea of dynamicprogramming

12


Alignment with GapsAAC-AATTAAG-ACTAC-GTTCATGAC

A-CGA-TTA-GCAC-ACTG-T-C-GA-

AACAATTAAGACTACGTTCATGAC---

AACAATT--------GTTCATGACGCA

I

II


Gaps

• Both alignments have the same number ofmatches and spaces but…alignment IIseems better.

• Definition: A gap is any maximal,consecutive run of spaces in a single string.

• The length of the gap will be the number ofspaces in it.

• Example I has 11 gaps while example II hasonly 2 gaps.


Biological Motivation

• Number of mutational events– A single gap - due to single event that removed

a number of residues.– Each separate gap - due to distinct independent

events.• Protein structure

– Protein secondary structure consists of alphahelixes, beta sheets and loops

– Loops of varying size can lead to very similarstructure.

13


Biological Motivation


cDNA matching

• cDNA - is the sequence after splicing and editing,after the introns have been removed.

• We expect regions of high similarity separated bylong gaps.

• These gaps correspond to the introns removed bysplicing


Gap Penalty Models• Constant Model

– Gives each gap a constant weight, spaces are free– Maximize:– Time– Works well for cDNA matching

• Affine Model– There is a penalty for starting a gap and a penalty for

each space extending it.– A single gap contributes– Maximize:– Time– Most widely used

gapss WTS gii #),( '' ×+∑)(nmO

spacesgapss WWTS sgii ##),( '' ×+×+∑)(nmO

WW sg q+

14


Gap Penalty Models

• Convex model– Each extra space contributes less penalty– Gap function is convex in length– Example– Time– Better model of biology

• General model– The weight of a gap is some arbitrary– Time

qW g log+

)log( mnmO

)(qw

)( 22 mnnmO +


Example RevisedAAC-AATTAAG-ACTAC-GTTCATGAC

A-CGA-TTA-GCAC-ACTG-T-C-GA-


AACAATT--------GTTCATGACGCA

I

II


Indel modelAAC-AATTAAG-ACTAC-GTTCATGAC

A-CGA-TTA-GCAC-ACTG-T-C-GA-I


AACAATT--------GTTCATGACGCAII

Scoring Parameters:Match: +1indel: -2

-6

-6

15


Constant modelAAC-AATTAAG-ACTAC-GTTCATGAC




Scoring Parameters:Match: +1open gap: -2

-6

12


Affine modelAAC-AATTAAG-ACTAC-GTTCATGAC




Scoring Parameters:Match: +1Open Gap: -2Extend Gap:-1

-17

1


Convex modelAAC-AATTAAG-ACTAC-GTTCATGAC




Scoring Parameters:Match: +1Open Gap: -2 Gap Length: -logn

-6

~7

16


Affine Weight Model

• We divide the possible alignments of theprefixes S1..i and T1..j into 3 types:

S_______iT_______j

S______i------T____________j

S____________iT_______j-----

A(i,j)

B(i,j)

C(i,j)


Affine Weight Model

Recurrence relations

+−−+−−+−−

=),()1,1(),()1,1(),()1,1(

max),(jisjiCjisjiBjisjiA

jiA

+−−

++−−=

s

sg

WjiBWWjiA

jiB)1,1(

)1,1(max),(

+−−

++−−=

s

sg

WjiCWWjiA

jiC)1,1(

)1,1(max),(


Affine Weight Model

Initial Conditions:

sg

sg

jWWiCjAiWWiBiA

+==

+==

)0,(),0(

)0,()0,(

Complexity•Time: O(nm) we compute 3 matrices.•Space: O(nm)

Optimal alignment :)},(),,(),,(max{),( mnCmnBmnAmnV =

17


Affine Weight Model

This model has a natural explanation as afinite state automata.

A

B

C

S(i,j)

Ws

Ws

Wg+Ws

S(i,j)

S(i,j)

Wg+Ws


Alignment in Real Life

• One of the major uses of alignments is tofind sequences in a “database”

• Such collections contain massive number ofsequences (order of 106)

• Finding homologies in these databases withdynamic programming can take too long


Heuristic Search

• Instead, most searches relay on heuristicprocedures

• These are not guaranteed to find the best match• Sometimes, they will completely miss a high-

scoring match

We now describe the main ideas used by some ofthese procedures– Actual implementations often contain additional tricks and hacks

18


Basic Intuition

• Almost all heuristic search procedure arebased on the observation that real-lifematches often contain long strings with gap-less matches

• These heuristic try to find significant gap-less matches and then extend them


Banded DP

• Suppose that we have two strings s[1..n] andt[1..m] such that n≈m

• If the optimal alignment of s and t has few gaps,then path of the alignment will be close todiagonal s

t


Banded DP

• To find such a path, it suffices tosearch in a diagonal region ofthe matrix

• If the diagonal band has width k,then the dynamic programmingstep takes O(kn)

• Much faster than O(n2) ofstandard DP

s

t k

19


Banded DP

Problem:• If we know that t[i..j] matches the query s,

then we can use banded DP to evaluatequality of the match

• However, we do not know this apriori!

How do we select which sequences to alignusing banded DP?


FASTA Overview

Main idea:• Find potential diagonals & evaluate them• Suppose that we have a relatively long gap-

less matchAGCGCCATGGATTGAGCGA

TGCGACATTGATCGACCTA

• Can we find “clues” that will let us find itquickly?


Signature of a Match

Assumption: good matches contain several“patches” of perfect matches

AGCGCCATGGATTGAGCGA

TGCGACATTGATCGACCTA

Since this is a gap-less alignment,all perfect match regionsshould be on one diagonal

s

t

20


FASTA

• Given s and t, and a parameter k• Find all pairs (i,j) such that s[i..i+k]=t[j..j+k]• Locate sets of pairs that are on the same diagonal

– By sorting according to i-j

• Compute score for the diagonal that containall of these pairs

s

t


FASTA

Postprocessing steps:– Find highest scoring diagonal matches– Combine these to potential gapped matches– Run banded DP on the region containing these

combinations

• Most applications of FASTA use very small k(2 for proteins, and 4-6 for DNA)


FASTA Output

SCORES Init1: 1201 Initn: 1844 Opt: 1915

Smith-Waterman score: 1915; 59.3% identity in 496 aa overlap

10 20 30 40

A41264 MADKKKITASLIYAVSVAAIGSLQFGYNTGVINAPEKIIQAFYNRTL

::::|::|: || |::|||||||| ||||||:|:|: ||:|

A49158 MPSGFQQIGSEDGEPPQQRVTGTLVLAVFSAVLGSLQFGYNIGVINAPQKVIEQSYNETW

10 20 30 40 50 60

50 60 70 80 90 100

A41264 SQRSG----ETISPELLTSLWSLSVAIFSVGGMIGSFSVSLFVNRFGRRNSMLLVNVLAF

|:| :| | ||:||:||||||||||||:|| :::: : :||: :||: ||||

A49158 LGRQGPEGPSSIPPGTLTTLWALSVAIFSVGGMISSFLIGIISQWLGRKRAMLVNNVLAV

70 80 90 100 110 120

. . . . . .

21


BLAST Overview

• BLAST uses similar intuition• It relies on high scoring matches rather

than exact matches• It is designed to find alignments of a target

string s against large databases


High-Scoring Pair

• Given parameters: length k, and thresholdT• Two strings s and t of length k are a high scoring

pair (HSP) if d(s,t) > T

• Given a query s[1..n], BLAST construct all wordsw, such that w is an HSP with a k-substring of s– Note that not all substrings of s are HSPs!

• These words serve as seeds for finding longermatches


Finding Potential Matches

We can locate seed words in a large databasein a single pass

• Construct a FSA that recognizes seed words• Using hashing techniques to locate

matching words

22


Extending Potential Matches

• Once a seed is found, BLAST attempts tofind a local alignment that extends the seed

• Seeds on the same diagonalare combined (as in FASTA)

s

t


BLAST programs

• BLASTN - Nucleotide query searching a nucleotidedatabase.

• BLASTP - Protein query searching a protein database.• BLASTX - Translated nucleotide query sequence (6

frames) searching a protein database.• TBLASTN - Protein query searching a translated

nucleotide (6 frames) database.• TBLASTX - Translated nucleotide query (6 frames)

searching a translated nucleotide (6 frames) database


BLAST Search

23


BLAST Output

• List of hits– Database accession codes, name, description.– Score in bits (Usually >30 bits is significant )– Expectation value E()

• For each hit– A header including hit name, description, length– Each hit may contain several HSPs– Score and expectation value

– how many identical residues– how many residues contributing positively to the score

• The local alignment itself


BLAST Output


BLAST Output

24


BLAST Output


What do we use

• Originally Blast did not allow gaps.– Now people use gapped-Blast– Gapped blast joins different diagonals.

• For proteins Blast is superior• For nucleotides Fasta is better.


Matrici di Punteggio

• L’allineamento dipende dal punteggioassegnato alle coppie di amminoacidi(matrici di punteggio) e dalla funzione dipenalizzazione degli indels

25


Matrici di punteggio

• Matrici di sostituzione amminoacidicaovvero Tavole di confronto tra simboli

• Esistono tavole per il confronto tra proteinee tra acidi nucleici

• Sono state pubblicate moltissime matricibasate su modelli diversi


Matrici di M. Dayhoff (1978)

• Percentuale di mutazioni accettate (PAM)• Questa famiglia di matrici è basata sulla

probabilità di sostituzione, durantel’evoluzione, di un amminoacido con unaltro in sequenze proteiche omologhe

• Ciascuna matrice misura i cambiamentiattesi per un dato periodo evolutivo


Matrici PAM

• Secondo questo modello, le sostituzioniamminoacidiche osservate lungo un certoperiodo di tempo, possono essereestrapolate per periodi più lunghi

• Nel calcolo delle matrici PAM, si assumeche il cambiamento ad un certo sito siaindipendente da precedenti eventimutazionali nello stesso sito

26


Calcolo di matrici PAM

• Basato su 1572 mutazioni in 71 gruppi disequenze simili almeno all’85%

• Le mutazioni non alteranosignificativamente la funzione delleproteine (mutazioni accettate)

• Le sequenze simili vengono organizzate inalberi filogenetici dai quali vengono desuntele mutazioni


Specie A A W T V A S A V R T S I

Specie B A Y T V A A A V R T S I

Specie C A W T V A A A V L T S I

A B C

L→ R

W→ Y



• pi = ai/atot frequenza dell’amminoacido i

• fij = n(ai→aj) numero di mutazioni ai→aj

• fi = ∑j fij numero di mutazioni di ai

• f = ∑i fi numero totale di mutazioni

27



• Si definisca la scala come il tempo evolutivonecessario per incorporare 1 amminoacido mutatosu 100:

1 PAM

• La mutabilità relativa di ai è:

mi = fi /100·f pi



• Se mi è la probabilità di mutazione di ai , allora

Mii = 1 – miè la probabilità di conservazione di ai

• La probabilità della mutazione ai→aj è

Mij = (fij / fi) mi



• La matrice Mij ottenuta è una matrice ditransizione

• In generale, per avere le probabilità per kintervalli evolutivi:

Mijk

• una delle matrici più utilizzate è PAM250

28



Matrice BLOSUM (Henikoff &Henikoff, 1992)

• Blocks Amino Acid Substitution Matrices= BLOSUM

• Basata sulle sostituzioni amminoacidicheosservate in ~2000 blocchi conservati disequenze.

• Questi blocchi sono stati estratti da unabanca dati di 500 famiglie di proteine

• Sono contati gli scambi amminoacidiciosservati in ciascuna colonna


Esempio di calcolo di matriceBLOSUM

...A...

...A...

...A...

...A...

...A...

...S...

...A...

...A...

...A...

...A... • 9 A e 1 S• 36 A→A (fAA) e 9 A→S

(fAS)• 210 possibili coppie di

amminoacidi• La frequenza di A→A è

qAA = fAA/(fAA + fAS) =0.8

• La frequenza di A→S èqAS = fAS/(fAA + fAS) =0.2

29



• La frequenza attesa che A sia coinvolta inuna coppia di mutazioni è pA = (qAA +qAS/2) = 0.9

• La frequenza attesa che S sia coinvolta inuna coppia di mutazioni è pS = (qAS/2) = 0.1

• La frequenza attesa di una coppia AA è eAA= pA

2 = 0.81• La frequenza attesa di una coppia AS è eAS

= 2 pA pS = 0.18



• Il valore per la coppia AA nella matrice èqAA/eAA = 0,99 e per AS è qAS/eAS = 1.11

• I valori sono convertiti in bits:sAA = log2(qAA/eAA) = -0.04sAS = log2(qAS/eAS) = 0,30


Calcolo di matrice BLOSUM

• Per bilanciare il sovracampionamento diresidui provenienti da sequenze moltosimili, le sequenze più simili di una certasoglia (per esempio 60% identità) sonoraggruppate e gli scambi amminoacidiciinterni al gruppo vengono mediati. Lamatrice risultante si chiama BLOSUM60

• La matrice più utilizzata è la BLOSUM62

30



The Hazard of Large Databases

• Define• This is the probability that two unrelated

sequences will match with score > εεεε bychance

• Assuming that they are independent of eachother, and all are unrelated to s, we have

)|),(( UtsdPp εε >=

εεε NpN

t eptsdP −−≈−−=> 1)1(1)),((max


Local Matching

• Question: Which local alignment query is expectedto give a higher score:– To a short sequence– To a long sequence?

• A local match can begin at any of the nm entries inthe DP matrix.

• The score is the optimal of all these starting points.• If all starting points were independent we would need

to calculate the probability of attaining such a scorein nm trails.

31


Score Significance-Fasta

• How meaningful is a score?• Calculate distribution of scores and related scores

• Under reasonable assumptions the scores for un-gapped alignment behave according to theExtreme Value Distribution


Extreme Value Distribution(BLAST)

• We ask the following questions: Given a databaseof size m and a sequence of size n

• What is the expected number of hits with score atleast S? This number is called an E-score

• Notice this is a Poisson distribution.• K corrects for the dependencies• λ depends on the scoring matrix• Doubling length of sequence doubles expectation• Doubling score causes E to decrease exponentially

SKmneSE λ−=)(


Blast P-value

• Recall Poisson distribution:– Probability of finding no hits with a score >= S

– Therefore probability of finding at least one hitwith score >= S is

– This is called the P-value.

Ee−

Ee−−1

32


Protein Families

• Consider Zinc Fingers:• All have the same function:

– Bind to DNA• All have similar structure• They constitute a Protein Family• In a protein family some parts of the

sequence (the functional parts) are moreconserved than others.


Multiple Alignment

• Proteins can be classified into families:– Common structure.– Common function.– Common evolutionary origin.

• For a set of sequences belonging to some family– Each pair has some differences– But, there are some common motifs in almost all sequences

of the family• A multiple alignment carries more information than

pairwise alignment


Definition

A multiple alignment of stringsS1,S2,…,Sk is a series of strings withblanks S’1,S’2,…,S’k such that:– |S’1|=|S’2|=…=|S’k|– S’j is an extension of Sj obtained by

insertion of blanks.

33


Example

AGT..CTT.ACGCG

AGTAGCTT...GCG

..TAGC.T..GGCG

.CTA.C.TAACCCG

ACTA...TAAC...


Example


Sum of Pairs

• The sum of pairwise distances between allpairs of sequences for some scoring matrix

• Not only assumes that alignment of eachcolumn is independent, but also each pair ofsequences.– Each sequence is scored as if descended from k-

1 sequences instead of one common ancestor.

),()( li

lk

kii mmsmS ∑

<

=

34


Calculation of MultipleAlignment

• The optimal alignment can be calculatedexactly using r-dimensional dynamicprogramming.– Space complexity O(nk)– Time complexity O(2knk)

• A Heuristic Program called ClustalWquickly finds a good multiple alignment.


The ClustalW Algorithm

• Three steps:– 1 Compare all pairs of sequences to obtain a

similarity matrix– 2 Based on the similarity matrix, make a guide

tree relating all the sequences– 3 Perform progressive alignment where the

order of the alignments is determined by theguide tree


ClustalW - Score of aligning twoalignment columns

• sum the score matrix entry for all pairs ofresidues

• weight each pair by the sequences’ weights

1: peeksavtal

2: geekaavlal

3: egewglvlhv

4: aaektkirsa

Score:M( t, v)+ M( t, i)+M( l, v)+ M( l, i)

35


ClustalW - Weighting sequences

• each sequence is given a weight• groups of related sequences receive lower

weight

1: peeksavtal

2: geekaavlal

3: egewglvlhv

4: aaektkirsa

Weighted score:w1* w3* M( t, v)+w1* s4* M( t, i)+

w2* w3* M( l, v)+w2* w4* M( l, i)


ClustalW - Similarity matrix

• Distance between sequences - measure fromthe guide tree - determines which matrix touse– 80- 100% seq- id -> use Blosum80– 60- 80% seq- id -> Blosum60– 30- 60% seq- id -> Blosum45– 0- 30% seq- id -> Blosum30


ClustalW - Gap penalties

• Initial gap penalty– GOP

• Gap extension penalty– GEP

GTEAKLIVLMANE

GA---------KL

Penalty: GOP+ 8* GEP

36


ClustalWModifications of gap penalty

• Position specific penalty– gap at position

• yes -> lower GOP– no, but gap within 8 residues -> increase GOP– hydrophilic residues

• lower GOP


ClustalW - summary

• Does not use a score for the final alignment• Each pairwise alignment is done using

dynamic programming• Heuristics (e. g., gap- penalty modifications)

are used - tailored to globular proteins• Graphical version: ClustalX


Creating a PSSM

• After aligning the sequences we see thatthere are some conserved regions.

• We use the multiple alignment of Blastresults to create a Position Specific ScoringMatrix.

• This matrix represents information from awhole family, it is more strict in highlyconserved regions.

37


PSI- BLAST (Position SpecificIterated)

• BLAST provides a new automatic “profile like” search.• Iterative procedure:

– Perform BLAST on database.– Use Significant alignments to construct a “position specific”

score matrix.– This matrix replaces the query sequence in the next round of

database searching.• The program may be iterated until no new significant

alignments are found.• Most commonly used search method today.

Documents

Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex